Chapter 2. Creating Collections with Comprehensions

A list comprehension is a high-level, declarative way to create a list. It looks like this:

>>> squares = [ n*n for n in range(6) ]
>>> print(squares)
[0, 1, 4, 9, 16, 25]

This is essentially equivalent to the following:

>>> squares = []
>>> for n in range(6):
...     squares.append(n*n)
>>> print(squares)
[0, 1, 4, 9, 16, 25]

Notice that in the first example, what you type is declaring what kind of list you want, while the second is specifying how to create it. That’s why we say it is high-level and declarative: it’s as if you are stating what kind of list you want created, then letting Python figure out how to build it.

Python lets you write other kinds of comprehensions than lists. Here’s a simple dictionary comprehension, for example:

>>> blocks = { n: "x" * n for n in range(5) }
>>> print(blocks)
{0: '', 1: 'x', 2: 'xx', 3: 'xxx', 4: 'xxxx'}

This is equivalent to the following:

>>> blocks = dict()
>>> for n in range(5):
...     blocks[n] = "x" * n
>>> print(blocks)
{0: '', 1: 'x', 2: 'xx', 3: 'xxx', 4: 'xxxx'}

The main benefits of comprehensions are readability and maintainability. Most people find them very readable; even developers encountering a comprehension for the first time will usually find their first guess about what it means to be correct. You can’t get more readable than that.

And there is a deeper, cognitive benefit: once you’ve practiced with comprehensions a bit, you will find you can write them with very little mental effort—keeping more of your attention free for other tasks.

Beyond lists and dictionaries, there are several other forms of comprehension you will learn about in this chapter. As you become comfortable with them, you will find them to be versatile and very Pythonic—meaning, they fit well into many other Python idioms and constructs, lending new expressiveness and elegance to your code.

List Comprehensions

List comprehensions are the most widely used kind of comprehension and are essentially a way to create and populate a list. Their structure looks like this:

[ EXPRESSION for VARIABLE in SEQUENCE ]

EXPRESSION is any Python expression, though in useful comprehensions, the expression often has some variable in it. That variable is stated in the VARIABLE field. SEQUENCE defines the source values the variable enumerates through, creating the final sequence of calculated values.

Here’s the simple example we glimpsed earlier:

>>> squares = [ n*n for n in range(6) ]
>>> type(squares)
<class 'list'>
>>> print(squares)
[0, 1, 4, 9, 16, 25]

Notice the result is just a regular list. In squares, the expression is n*n; the variable is n; and the source sequence is range(6). The sequence is a range object; in fact, it can be any iterable…​another list or tuple, a generator object, or something else.

The expression part can be anything that reduces to a value, including:

  • Arithmetic expressions like n+3

  • A function call like f(m), using m as the variable

  • A slice operation (like s[::-1], to reverse a string)

  • Method calls

Some complete examples:

>>> # First define some source sequences...
>>> pets = ["dog", "parakeet", "cat", "llama"]
>>> numbers = [ 9, -1, -4, 20, 11, -3 ]
>>> # And a helper function...
>>> def repeat(s):
...     return s + s
...
>>> # Now, some list comprehensions:
>>> [ 2*m+3 for m in range(10, 20, 2) ]
[23, 27, 31, 35, 39]
>>> [ abs(num) for num in numbers ]
[9, 1, 4, 20, 11, 3]
>>> [ 10 - x for x in numbers ]
[1, 11, 14, -10, -1, 13]
>>> [ pet.upper() for pet in pets ]
['DOG', 'PARAKEET', 'CAT', 'LLAMA']
>>> [ "The " + pet for pet in sorted(pets) ]
['The cat', 'The dog', 'The llama', 'The parakeet']
>>> [ repeat(pet) for pet in pets ]
['dogdog', 'parakeetparakeet', 'catcat', 'llamallama']

Notice how all these fit the same structure. They all have the keywords for and in; those are required in Python for any kind of comprehension you may write. These are interleaved among three fields: the expression, the variable (the identifier from which the expression is composed), and the source sequence.

The order of elements in the final list is determined by the order of the source sequence. You can filter out elements by adding an if clause:

>>> def is_palindrome(s):
...     return s == s[::-1]
...
>>> pets = ["dog", "parakeet", "cat", "llama"]
>>> numbers = [ 9, -1, -4, 20, 11, -3 ]
>>> words = ["bib", "bias", "dad", "eye", "deed", "tooth"]
>>>
>>> [ n*2 for n in numbers if n % 2 == 0 ]
[-8, 40]
>>>
>>> [pet.upper() for pet in pets if len(pet) == 3]
['DOG', 'CAT']
>>>
>>> [n for n in numbers if n > 0]
[9, 20, 11]
>>>
>>> [word for word in words if is_palindrome(word)]
['bib', 'dad', 'eye', 'deed']

The structure is

[ EXPR for VAR in SEQUENCE if CONDITION ]

where CONDITION is an expression that evaluates to True or False, depending on the variable.1 Note that it can be either a function applied to the variable (is_​palin⁠drome(word)), or a more complex expression. Choosing to use a function can improve readability, and also let you apply filter logic whose code won’t fit on one line.

A list comprehension must always have the for keyword, even if the beginning expression is just the variable itself. For example:

>>> [word for word in words if is_palindrome(word)]
['bib', 'dad', 'eye', 'deed']

Sometimes people think word for word in words seems redundant (because it is), and try to shorten it. But that does not work:

>>> [word in words if is_palindrome(word)]
  File "<stdin>", line 1
    [word in words if is_palindrome(word)]
                                         ^
SyntaxError: invalid syntax

Formatting for Readability (and More)

Realistic list comprehensions tend to be too long to fit nicely on a single line. And they are composed of distinct logical parts, which can vary independently as the code evolves. This creates a couple of inconveniences, which are solved by a convenient fact: Python’s normal rules of whitespace are suspended inside the square brackets. You can exploit this to make them more readable and maintainable, splitting them across multiple lines:

def double_short_words(words):
    return [ word + word
             for word in words
             if len(word) < 5 ]

Another variation, which some people prefer:

def double_short_words(words):
    return [
        word + word
        for word in words
        if len(word) < 5
    ]

What I’ve done here is split the comprehension across separate lines. You can, and should, do this with any substantial comprehension. It’s great for several reasons, the most important being the instant gain in readability. This comprehension has three separate ideas expressed inside the square brackets: the expression (word + word); the sequence (for word in words); and the filtering clause (if len(word) < 5). These are logically separate aspects, and splitting them across different lines takes less cognitive effort for a human to read and understand than the one-line version. It’s effectively preparsed for you, as you read the code.

Splitting a comprehension over several lines has another benefit: it makes version control and code review diffs more pinpointed. Imagine you and I are on the same development team, working on this code base in different feature branches. In my branch, I change the expression to word * 2; in yours, you change the threshold to len(word) < 7. If the comprehension is on one line, version control tools will perceive this as a merge conflict, and whoever merges last will have to manually fix it.2 But since this list comprehension is split across three lines, our source control tool can automatically merge both our branches. And if we’re doing code reviews like we should be, the reviewer can identify the precise change immediately, without having to scan the line and think.

Multiple Sources and Filters

You can have several for VAR in SEQUENCE clauses. This lets you construct lists based on pairs, triplets, etc., from two or more source sequences:

>>> colors = ["orange", "purple", "pink"]
>>> toys = ["bike", "basketball", "skateboard", "doll"]
>>>
>>> [ color + " " + toy
...   for color in colors
...   for toy in toys ]
['orange bike', 'orange basketball', 'orange skateboard',
 'orange doll', 'purple bike', 'purple basketball',
 'purple skateboard', 'purple doll', 'pink bike',
 'pink basketball', 'pink skateboard', 'pink doll']

Every pair from the two sources, colors and toys, is used to calculate a value in the final list. That final list has 12 elements, the product of the lengths of the 2 source lists.

Notice the two for clauses are independent of each other; colors and toys are two unrelated lists. Using multiple for clauses can sometimes take a different form, where they are more interdependent. Consider this example:

>>> ranges = [range(1,7), range(4,12,3), range(-5,9,4)]
>>> [ float(num)
...   for subrange in ranges
...   for num in subrange ]
[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 4.0, 7.0, 10.0, -5.0,
-1.0, 3.0, 7.0]

The source sequence—ranges—is a list of range objects.3 Now, this list comprehension has two for clauses again. But notice one depends on the other. The source of the second is the variable for the first!

It’s not like the colorful-toys example, whose for clauses are independent of each other. When chained together this way, order matters:

>>> [ float(num)
...   for num in subrange
...   for subrange in ranges ]
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
NameError: name 'subrange' is not defined

Python parses the list comprehension from left to right. If the first clause is for num in subrange, at that point subrange is not defined. So you have to put for subrange in ranges first. You can chain more than two for clauses together like this; the first clause will just need to reference a previously-defined source, and the others can use sources defined in the previous for clause, like subrange is defined.

Independent Clauses

Now, that’s for chained for clauses. If the clauses are independent, does the order matter at all? It does, just in a different way. What’s the difference between these two list comprehensions?

>>> colors = ["orange", "purple", "pink"]
>>> toys = ["bike", "basketball", "skateboard", "doll"]
>>>
>>> [ color + " " + toy
...   for color in colors
...   for toy in toys ]
['orange bike', 'orange basketball', 'orange skateboard',
'orange doll', 'purple bike', 'purple basketball',
'purple skateboard', 'purple doll', 'pink bike',
'pink basketball', 'pink skateboard', 'pink doll']
>>>
>>> [ color + " " + toy
...   for toy in toys
...   for color in colors ]
['orange bike', 'purple bike', 'pink bike', 'orange
basketball', 'purple basketball', 'pink basketball',
'orange skateboard', 'purple skateboard', 'pink
skateboard', 'orange doll', 'purple doll', 'pink doll']

The order here doesn’t matter in the sense it does for chained for clauses, where you must put things in a certain order, or your program won’t run. Here, you have a choice. And that choice does affect the order of elements in the final comprehension.

For both versions, the first element is “orange bike”. But the second element is different. Ask yourself: why? Why is the first element the same in both comprehensions? And why is the second element different?

It has to do with which sequence is held constant while the other varies. It’s the same logic that applies when nesting regular for loops:

>>> # Nested one way...
... build_colors_toys = []
>>> for color in colors:
...     for toy in toys:
...         build_colors_toys.append(color + " " + toy)
>>> build_colors_toys[0]
'orange bike'
>>> build_colors_toys[1]
'orange basketball'
>>>
>>> # And nested the other way.
... build_toys_colors = []
>>> for toy in toys:
...     for color in colors:
...         build_toys_colors.append(color + " " + toy)
>>> build_toys_colors[0]
'orange bike'
>>> build_toys_colors[1]
'purple bike'

The second for clause in the list comprehension corresponds to the inner for loop. Its values vary through their range more rapidly than those in the outer one.

Multiple Filters

In addition to using several for clauses, you can have more than one if clause, for multiple levels of filtering. Just write several of them in sequence:

>>> numbers = [ 9, -1, -4, 20, 17, -3 ]
>>> odd_positives = [
...     num for num in numbers
...     if num > 0
...     if num % 2 == 1
... ]
>>> print(odd_positives)
[9, 17]

Here, I’ve placed each if clause on its own line, for readability—but I could have put both on one line. When you have more than one if clause, they are “and-ed” together, not “or-ed” together. Equivalent to this:

>>> numbers = [ 9, -1, -4, 20, 17, -3 ]
>>> odd_positives = [
...     num for num in numbers
...     if num > 0 and num % 2 == 1
... ]
>>> print(odd_positives)
[9, 17]

The only difference is readability. When you feel one if clause with an “and” will be more readable, do that; when you feel multiple if clauses will be more readable, do that.

What if you want to include elements matching at least one of the if-clause criteria, omitting only those not matching any? In that case, you must use a single if clause with an “or”. You cannot “or” multiple if clauses together inside a comprehension. For example, here’s how you can filter based on whether the number is a multiple of 2 or 3:

>>> numbers = [ 9, -1, -4, 20, 11, -3 ]
>>> [ num for num in numbers
...   if num % 2 == 0 or num % 3 == 0 ]
[9, -4, 20, -3]

You can also define a helper function. When your filtering logic is complex or non-obvious, this will often improve readability, and is worth considering:

>>> numbers = [ 9, -1, -4, 20, 11, -3 ]
>>> def num_is_valid(num):
...     return num % 2 == 0 or num % 3 == 0
...
>>> [ num for num in numbers
...   if num_is_valid(num) ]
[9, -4, 20, -3]

The comprehension mini-language is not as expressive as Python itself, and some lists cannot be expressed as a comprehension.

You can use multiple for and if clauses together:

>>> weights = [0.2, 0.5, 0.9]
>>> values = [27.5, 13.4]
>>> offsets = [4.3, 7.1, 9.5]
>>>
>>> [ (weight, value, offset)
...   for weight in weights
...   for value in values
...   for offset in offsets
...   if offset > 5.0
...   if weight * value < offset ]
[(0.2, 27.5, 7.1), (0.2, 27.5, 9.5), (0.2, 13.4, 7.1),
(0.2, 13.4, 9.5), (0.5, 13.4, 7.1), (0.5, 13.4, 9.5)]

The only rule is that the first for clause must come before the first if clause. Other than that, you can interleave for and if clauses in any order. Most people seem to find it more readable to group all the for clauses together at first, then the if clauses together at the end.

Comprehensions and Generators

List comprehensions create lists:

>>> squares = [ n*n for n in range(6) ]
>>> type(squares)
<class 'list'>

When you need a list, that’s great, but sometimes you don’t need a list, and you’d prefer something which does not blow up your memory footprint. It’s like the situation near the start of Chapter 1:

# This again.
NUM_SQUARES = 10*1000*1000
many_squares = [ n*n for n in range(NUM_SQUARES) ]
for number in many_squares:
    do_something_with(number)

The entire many_squares list must be fully created—all memory for it must be allocated, and every element calculated—before do_something_with() is called even once. And memory usage goes through the roof.

You know one solution: write a generator function, and call it. But there’s an easier option: write a generator expression. This is the official name for it, but it really should be called a “generator comprehension”, in my humble but correct opinion. Syntactically, it looks like a list comprehension—except you use parentheses instead of square brackets:

>>> generated_squares = ( n*n for n in range(NUM_SQUARES) )
>>> type(generated_squares)
<class 'generator'>

This generator expression creates a generator object, in the exact same way a list comprehension creates a list. Any list comprehension you write, you can use to create an equivalent generator object, just by swapping “("and")” for “ ["and"]”.

And you’re creating the object directly, without having to define a generator function to call. In other words, a generator expression is a convenient shortcut when you need a quick generator object:

# This...
many_squares = ( n*n for n in range(NUM_SQUARES) )

# ... is EXACTLY EQUIVALENT to this:
def gen_many_squares(limit):
    for n in range(limit):
        yield n * n
many_squares = gen_many_squares(NUM_SQUARES)

As far as Python is concerned, these two versions of many_squares are completely equivalent.

Everything you know about list comprehensions applies to generator expressions: multiple for clauses, if clauses, etc. You only need to type the parentheses.

In fact, sometimes you can even omit them. When passing a generator expression as an argument to a function, you will sometimes find yourself typing (( followed by )). In that situation, Python lets you omit the inner pair.

Imagine, for example, you are sorting a list of customer email addresses, looking at only those customers whose status is “active”:

>>> # User is a class with "email" and "is_active" fields.
... # all_users is a list of User objects.

>>> # Sorted list of active user's email addresses.
... # Passing in a generator expression.
>>> sorted((user.email for user in all_users
...          if user.is_active))
['fred@a.com', 'sandy@f.net', 'tim@d.com']
>>>
>>> # Omitting the inner parentheses.
... # Still passing in a generator expression!
>>> sorted(user.email for user in all_users
...        if user.is_active)
['fred@a.com', 'sandy@f.net', 'tim@d.com']

Notice how readable and natural this is (or will be, once you’ve practiced a bit). One thing to watch out for: you can only inline a generator expression this way when passing it to a function or method of one argument. Otherwise, you get a syntax error:

>>>
>>> # Reverse that list. Whoops...
... sorted(user.email for user in all_users
...         if user.is_active, reverse=True)
  File "<stdin>", line 2
SyntaxError: Generator expression must be parenthesized if not sole argument

Python cannot interpret what you mean here, because it is ambiguous in Python’s grammar. So you must use the inner parentheses:

>>> # Okay, THIS will get the reversed list.
... sorted((user.email for user in all_users
...         if user.is_active), reverse=True)
['tim@d.com', 'sandy@f.net', 'fred@a.com']

Sometimes it is more readable to assign the generator expression to a variable:

>>> active_emails = (
...        user.email for user in all_users
...        if user.is_active
... )

>>> sorted(active_emails, reverse=True)
['tim@d.com', 'sandy@f.net', 'fred@a.com']

Generator expressions without parentheses suggest a unified way of thinking about comprehensions, that links generator expressions and list comprehensions together. Here’s a generator expression for a sequence of squares:

( n**2 for n in range(10) )

Here it is again, passed to the built-in list() function:

list( n**2 for n in range(10) )

And here it is as a list comprehension:

[ n**2 for n in range(10) ]

When you understand generator expressions, it’s easy to see list comprehensions as a derivative data structure. The same applies for dictionary and set comprehensions (covered next). Even though Python does not work that way internally, this mental model is fully consistent with Python’s semantics.

With this insight, you start seeing new opportunities to use all these comprehension forms in your own code—improving readability, maintainability, and performance in the process.

If generator expressions are so great, why would you ever use list comprehensions? Generally speaking, your code will be more scalable and responsive if you use a generator expression. Except, of course, when you actually need a list. There are several considerations.

First, if the sequence is unlikely to be very big—and by “big”, I mean a minimum of thousands of elements long—you probably won’t benefit from using a generator expression. That’s just not big enough for the memory footprint to matter.

Next, generator expressions do not always fit the usage pattern you need. If you need random access, or to go through the sequence twice, generator expressions won’t work. Generator expressions also won’t work if you need to append or remove elements, or change the value at some index so that you can look it up later.

This is especially important when writing a method or function whose return value is a sequence. Do you return a generator expression, or a list comprehension?

In theory, there’s no reason to ever return a list instead of a generator object; the caller can turn a generator object into a list just by passing it to list(). In practice, the interface may be such that the caller will want an actual list; forcing them to deal with a generator object will just get in the way. Also, if you are constructing the return value as a list within the function, it’s silly to return a generator expression over it—just return the actual list.

If your intention is to create a library usable by people who may not be advanced Pythonistas, that can be an argument for returning lists. Almost all programmers are familiar with list-like data structures. But fewer are familiar with how generators work in Python, and may—quite reasonably—get confused when confronted with a generator object.

Dictionaries, Sets, and Tuples

Just like a list comprehension creates a list, a dictionary comprehension creates a dictionary. You saw an example at the beginning of this chapter; here’s another. Suppose you have this Student class:

class Student:
    def __init__(self, name, gpa, major):
        self.name = name
        self.gpa = gpa
        self.major = major

Given a list named students, containing Student instances, we can write a dictionary comprehension mapping student names to their GPAs:

>>> { student.name: student.gpa for student in students }
{'Jim Smith': 3.6, 'Ryan Spencer': 3.1,
 'Penny Gilmore': 3.9, 'Alisha Jones': 2.5,
 'Todd Reynolds': 3.4}

The syntax differs from that of list comprehensions in two ways. Instead of square brackets, you’re using curly braces—which makes sense, since this creates a dictionary. The other difference is the expression field, whose format is “key: value”, since a dict has key-value pairs. So the structure is:

{ KEY : VALUE for VARIABLE in SEQUENCE }

These are the only differences. Everything else you learned about list comprehensions applies, including filtering with if clauses:

>>> def invert_name(name):
...     first, last = name.split(" ", 1)
...     return last + ", " + first
...
>>> # Get "lastname, firstname" of high-GPA students.
... { invert_name(student.name): student.gpa
...   for student in students
...   if student.gpa > 3.5 }
{'Smith, Jim': 3.6, 'Gilmore, Penny': 3.9}

You can create sets too. Set comprehensions look exactly like list comprehensions, but with curly braces instead of square brackets:

>>> # A list of student majors...
... [ student.major for student in students ]
['Computer Science', 'Economics', 'Computer Science',
 'Economics', 'Basket Weaving']
>>> # And the same as a set:
... { student.major for student in students }
{'Economics', 'Computer Science', 'Basket Weaving'}
>>> # You can also use the set() built-in.
... set(student.major for student in students)
{'Economics', 'Computer Science', 'Basket Weaving'}

(How does Python distinguish between a set and dict comprehension? dict​’s expression is a key-value pair, while set​’s is a single value.)

What about tuple comprehensions? This is fun: strictly speaking, Python doesn’t support them. However, you can pretend it does by using tuple():

>>> tuple(student.gpa for student in students
...       if student.major == "Computer Science")
(3.6, 2.5)

This creates a tuple, but it’s not a tuple comprehension. You’re calling the tuple constructor, and passing it a single argument. What’s that argument? A generator expression! In other words, you’re doing this:

>>> cs_students = (
...     student.gpa for student in students
...     if student.major == "Computer Science"
...     )
>>> type(cs_students)
<class 'generator'>
>>> tuple(cs_students)
(3.6, 2.5)
>>>
>>> # Same as:
... tuple((student.gpa for student in students
...        if student.major == "Computer Science"))
(3.6, 2.5)
>>> # But you can omit the inner parentheses.

tuple​’s constructor takes an iterator as an argument. The cs_students is a generator object (created by the generator expression), and a generator object is an iterator. So you can pretend Python has tuple comprehensions, using “tuple(” as the opener and “)” as the close. In fact, this also gives you alternate ways to create dictionary and set comprehensions:

>>> # Same as:
... # { student.name: student.gpa for student in students }
>>> dict((student.name, student.gpa)
...      for student in students)
{'Jim Smith': 3.6, 'Penny Gilmore': 3.9,
 'Alisha Jones': 2.5, 'Ryan Spencer': 3.1,
 'Todd Reynolds': 3.4}
>>> # Same as:
... # { student.major for student in students }
>>> set(student.major for student in students)
{'Computer Science', 'Basket Weaving', 'Economics'}

Remember, when you pass a generator expression into a function, you can omit the inner parentheses. That’s why you can, for example, type

tuple(f(x) for x in numbers)

Instead of

tuple((f(x) for x in numbers))

One last point. Generator expressions are a scalable analog of list comprehensions; is there any such equivalent for dicts, or for sets? No, but you can still construct generator expressions and pass the resulting generator object to their constructor, much like you did with tuple.

For dict, you will want the yielded elements to be (key, value) tuples. For sets, it is maximally efficient to code that generator expression to only yield unique values. But that is not always worth the trouble; if duplicates are generated, the set constructor will handle it fine.

Conclusion

Comprehensions are a useful tool for readable, maintainable Python. Their sensible succinctness and high-level, declarative nature make them easy to write, easy to read, and easy to maintain. Use them more in your code, and you will find your Python experience greatly improved.

1 Technically, the condition does not have to depend on the variable. But useful examples of this are extremely rare.

2 I like to think future version control tools will automatically resolve this kind of situation. I believe it will require the tool to have knowledge of the language grammar, so it can parse and reason about different clauses in a line of code.

3 Refresher: The range() built-in returns an iterator over a sequence of integers, and can be called with 1, 2, or 3 arguments. The most general form is range(start, stop, step), beginning at start, going up to but not including stop, in increments of step. Called with two arguments, the step-size defaults to 1; with one argument, that argument is the stop, and the sequence starts at 0.