A list comprehension is a high-level, declarative way to create a list. It looks like this:
>>>squares=[n*nforninrange(6)]>>>(squares)[0, 1, 4, 9, 16, 25]
This is essentially equivalent to the following:
>>>squares=[]>>>forninrange(6):...squares.append(n*n)>>>(squares)[0, 1, 4, 9, 16, 25]
Notice that in the first example, what you type is declaring what kind of list you want, while the second is specifying how to create it. That’s why we say it is high-level and declarative: it’s as if you are stating what kind of list you want created, then letting Python figure out how to build it.
Python lets you write other kinds of comprehensions than lists. Here’s a simple dictionary comprehension, for example:
>>>blocks={n:"x"*nforninrange(5)}>>>(blocks){0: '', 1: 'x', 2: 'xx', 3: 'xxx', 4: 'xxxx'}
This is equivalent to the following:
>>>blocks=dict()>>>forninrange(5):...blocks[n]="x"*n>>>(blocks){0: '', 1: 'x', 2: 'xx', 3: 'xxx', 4: 'xxxx'}
The main benefits of comprehensions are readability and maintainability. Most people find them very readable; even developers encountering a comprehension for the first time will usually find their first guess about what it means to be correct. You can’t get more readable than that.
And there is a deeper, cognitive benefit: once you’ve practiced with comprehensions a bit, you will find you can write them with very little mental effort—keeping more of your attention free for other tasks.
Beyond lists and dictionaries, there are several other forms of comprehension you will learn about in this chapter. As you become comfortable with them, you will find them to be versatile and very Pythonic—meaning, they fit well into many other Python idioms and constructs, lending new expressiveness and elegance to your code.
List comprehensions are the most widely used kind of comprehension and are essentially a way to create and populate a list. Their structure looks like this:
[EXPRESSIONforVARIABLEinSEQUENCE]
EXPRESSION is any Python expression, though in useful
comprehensions, the expression often has some variable in it. That
variable is stated in the VARIABLE field. SEQUENCE defines the
source values the variable enumerates through, creating the final
sequence of calculated values.
Here’s the simple example we glimpsed earlier:
>>>squares=[n*nforninrange(6)]>>>type(squares)<class 'list'>>>>(squares)[0, 1, 4, 9, 16, 25]
Notice the result is just a regular list. In squares, the expression
is n*n; the variable is n; and the source sequence is range(6).
The sequence is a range object; in fact, it can be any
iterable…another list or tuple, a generator object, or something
else.
The expression part can be anything that reduces to a value, including:
Arithmetic expressions like n+3
A function call like f(m), using m as the variable
A slice operation (like s[::-1], to reverse a string)
Method calls
Some complete examples:
>>># First define some source sequences...>>>pets=["dog","parakeet","cat","llama"]>>>numbers=[9,-1,-4,20,11,-3]>>># And a helper function...>>>defrepeat(s):...returns+s...>>># Now, some list comprehensions:>>>[2*m+3forminrange(10,20,2)][23, 27, 31, 35, 39]>>>[abs(num)fornuminnumbers][9, 1, 4, 20, 11, 3]>>>[10-xforxinnumbers][1, 11, 14, -10, -1, 13]>>>[pet.upper()forpetinpets]['DOG', 'PARAKEET', 'CAT', 'LLAMA']>>>["The "+petforpetinsorted(pets)]['The cat', 'The dog', 'The llama', 'The parakeet']>>>[repeat(pet)forpetinpets]['dogdog', 'parakeetparakeet', 'catcat', 'llamallama']
Notice how all these fit the same structure. They all have the
keywords for and in; those are required in Python for any
kind of comprehension you may write. These are interleaved among three
fields: the expression, the variable (the identifier from which
the expression is composed), and the source sequence.
The order of elements in the final list is determined by the order of
the source sequence. You can filter out elements by adding an if
clause:
>>>defis_palindrome(s):...returns==s[::-1]...>>>pets=["dog","parakeet","cat","llama"]>>>numbers=[9,-1,-4,20,11,-3]>>>words=["bib","bias","dad","eye","deed","tooth"]>>>>>>[n*2forninnumbersifn%2==0][-8, 40]>>>>>>[pet.upper()forpetinpetsiflen(pet)==3]['DOG', 'CAT']>>>>>>[nforninnumbersifn>0][9, 20, 11]>>>>>>[wordforwordinwordsifis_palindrome(word)]['bib', 'dad', 'eye', 'deed']
The structure is
[ EXPR for VAR in SEQUENCE if CONDITION ]
where CONDITION is an expression that evaluates to True or
False, depending on the variable.1 Note that it can be either a function
applied to the variable (is_palindrome(word)), or a more complex
expression. Choosing to use a function can improve readability, and
also let you apply filter logic whose code won’t fit on one line.
A list comprehension must always have the for keyword, even if the
beginning expression is just the variable itself. For example:
>>>[wordforwordinwordsifis_palindrome(word)]['bib', 'dad', 'eye', 'deed']
Sometimes people think word for word in words seems redundant
(because it is), and try to shorten it. But that does not work:
>>>[wordinwordsifis_palindrome(word)]File"<stdin>", line1[wordinwordsifis_palindrome(word)]^SyntaxError:invalid syntax
Realistic list comprehensions tend to be too long to fit nicely on a single line. And they are composed of distinct logical parts, which can vary independently as the code evolves. This creates a couple of inconveniences, which are solved by a convenient fact: Python’s normal rules of whitespace are suspended inside the square brackets. You can exploit this to make them more readable and maintainable, splitting them across multiple lines:
defdouble_short_words(words):return[word+wordforwordinwordsiflen(word)<5]
Another variation, which some people prefer:
defdouble_short_words(words):return[word+wordforwordinwordsiflen(word)<5]
What I’ve done here is split the comprehension across separate
lines. You can, and should, do this with any substantial
comprehension. It’s great for several reasons, the most important
being the instant gain in readability. This comprehension has three
separate ideas expressed inside the square brackets: the expression
(word + word); the sequence (for word in words); and the filtering
clause (if len(word) < 5). These are logically separate aspects, and
splitting them across different lines takes less cognitive
effort for a human to read and understand than the one-line
version. It’s effectively preparsed for you, as you read the code.
Splitting a comprehension over several lines has another benefit: it
makes version control and code review diffs more pinpointed. Imagine
you and I are on the same development team, working on this code base
in different feature branches. In my branch, I change the expression
to word * 2; in yours, you change the threshold to len(word) <
7. If the comprehension is on one line, version control tools will
perceive this as a merge conflict, and whoever merges last will have
to manually fix it.2 But
since this list comprehension is split across three lines, our source
control tool can automatically merge both our branches. And if we’re
doing code reviews like we should be, the reviewer can identify the
precise change immediately, without having to scan the line and think.
You
can have several for VAR in SEQUENCE clauses. This lets you
construct lists based on pairs, triplets, etc., from two or more
source sequences:
>>>colors=["orange","purple","pink"]>>>toys=["bike","basketball","skateboard","doll"]>>>>>>[color+" "+toy...forcolorincolors...fortoyintoys]['orange bike', 'orange basketball', 'orange skateboard','orange doll', 'purple bike', 'purple basketball','purple skateboard', 'purple doll', 'pink bike','pink basketball', 'pink skateboard', 'pink doll']
Every pair from the two sources, colors and toys, is used to
calculate a value in the final list. That final list has 12 elements,
the product of the lengths of the 2 source lists.
Notice the two for clauses are independent of each other; colors
and toys are two unrelated lists. Using multiple for clauses can
sometimes take a different form, where they are more
interdependent. Consider this example:
>>>ranges=[range(1,7),range(4,12,3),range(-5,9,4)]>>>[float(num)...forsubrangeinranges...fornuminsubrange][1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 4.0, 7.0, 10.0, -5.0,-1.0, 3.0, 7.0]
The source sequence—ranges—is a list of range
objects.3 Now, this list
comprehension has two for clauses again. But notice one depends on
the other. The source of the second is the variable for the first!
It’s not like the colorful-toys example, whose for clauses are
independent of each other. When chained together this way, order
matters:
>>>[float(num)...fornuminsubrange...forsubrangeinranges]Traceback (most recent call last):File"<stdin>", line2, in<module>NameError:name 'subrange' is not defined
Python parses the list comprehension from left to right. If the first
clause is for num in subrange, at that point subrange is not
defined. So you have to put for subrange in ranges first. You can
chain more than two for clauses together like this; the first clause
will just need to reference a previously-defined source, and the
others can use sources defined in the previous for clause, like
subrange is defined.
Now, that’s for chained for clauses. If the clauses are independent,
does the order matter at all? It does, just in a different way. What’s
the difference between these two list comprehensions?
>>>colors=["orange","purple","pink"]>>>toys=["bike","basketball","skateboard","doll"]>>>>>>[color+" "+toy...forcolorincolors...fortoyintoys]['orange bike', 'orange basketball', 'orange skateboard','orange doll', 'purple bike', 'purple basketball','purple skateboard', 'purple doll', 'pink bike','pink basketball', 'pink skateboard', 'pink doll']>>>>>>[color+" "+toy...fortoyintoys...forcolorincolors]['orange bike', 'purple bike', 'pink bike', 'orangebasketball', 'purple basketball', 'pink basketball','orange skateboard', 'purple skateboard', 'pinkskateboard', 'orange doll', 'purple doll', 'pink doll']
The order here doesn’t matter in the sense it does for chained for
clauses, where you must put things in a certain order, or your
program won’t run. Here, you have a choice. And that choice does
affect the order of elements in the final comprehension.
For both versions, the first element is “orange bike”. But the second element is different. Ask yourself: why? Why is the first element the same in both comprehensions? And why is the second element different?
It has to do with which sequence is held constant while the other
varies. It’s the same logic that applies when nesting regular for
loops:
>>># Nested one way......build_colors_toys=[]>>>forcolorincolors:...fortoyintoys:...build_colors_toys.append(color+" "+toy)>>>build_colors_toys[0]'orange bike'>>>build_colors_toys[1]'orange basketball'>>>>>># And nested the other way....build_toys_colors=[]>>>fortoyintoys:...forcolorincolors:...build_toys_colors.append(color+" "+toy)>>>build_toys_colors[0]'orange bike'>>>build_toys_colors[1]'purple bike'
The second for clause in the list comprehension corresponds to the
inner for loop. Its values vary through their range more rapidly
than those in the outer one.
In addition to using several for clauses, you can have more than one
if clause, for multiple levels of filtering. Just write several of
them in sequence:
>>>numbers=[9,-1,-4,20,17,-3]>>>odd_positives=[...numfornuminnumbers...ifnum>0...ifnum%2==1...]>>>(odd_positives)[9, 17]
Here, I’ve placed each if clause on its own line, for readability—but I could have put both on one line. When you have
more than one if clause, they are “and-ed” together, not
“or-ed” together. Equivalent to this:
>>>numbers=[9,-1,-4,20,17,-3]>>>odd_positives=[...numfornuminnumbers...ifnum>0andnum%2==1...]>>>(odd_positives)[9, 17]
The only difference is readability. When you feel one if clause with
an “and” will be more readable, do that; when you feel multiple if
clauses will be more readable, do that.
What if you want to include elements matching at least one of the
if-clause criteria, omitting only those not matching any? In that
case, you must use a single if clause with an “or”. You cannot “or”
multiple if clauses together inside a comprehension. For example,
here’s how you can filter based on whether the number is a multiple of
2 or 3:
>>>numbers=[9,-1,-4,20,11,-3]>>>[numfornuminnumbers...ifnum%2==0ornum%3==0][9, -4, 20, -3]
You can also define a helper function. When your filtering logic is complex or non-obvious, this will often improve readability, and is worth considering:
>>>numbers=[9,-1,-4,20,11,-3]>>>defnum_is_valid(num):...returnnum%2==0ornum%3==0...>>>[numfornuminnumbers...ifnum_is_valid(num)][9, -4, 20, -3]
The comprehension mini-language is not as expressive as Python itself, and some lists cannot be expressed as a comprehension.
You can use multiple for and if clauses together:
>>>weights=[0.2,0.5,0.9]>>>values=[27.5,13.4]>>>offsets=[4.3,7.1,9.5]>>>>>>[(weight,value,offset)...forweightinweights...forvalueinvalues...foroffsetinoffsets...ifoffset>5.0...ifweight*value<offset][(0.2, 27.5, 7.1), (0.2, 27.5, 9.5), (0.2, 13.4, 7.1),(0.2, 13.4, 9.5), (0.5, 13.4, 7.1), (0.5, 13.4, 9.5)]
The only rule is that the first for clause must come before the
first if clause. Other than that, you can interleave for and if
clauses in any order. Most people seem to find it more readable
to group all the for clauses together at first, then the if
clauses together at the end.
List comprehensions create lists:
>>>squares=[n*nforninrange(6)]>>>type(squares)<class 'list'>
When you need a list, that’s great, but sometimes you don’t need a list, and you’d prefer something which does not blow up your memory footprint. It’s like the situation near the start of Chapter 1:
# This again.NUM_SQUARES=10*1000*1000many_squares=[n*nforninrange(NUM_SQUARES)]fornumberinmany_squares:do_something_with(number)
The entire many_squares list must be fully created—all memory for
it must be allocated, and every element calculated—before
do_something_with() is called even once. And memory usage goes
through the roof.
You know one solution: write a generator function, and call it. But there’s an easier option: write a generator expression. This is the official name for it, but it really should be called a “generator comprehension”, in my humble but correct opinion. Syntactically, it looks like a list comprehension—except you use parentheses instead of square brackets:
>>>generated_squares=(n*nforninrange(NUM_SQUARES))>>>type(generated_squares)<class 'generator'>
This generator expression creates a generator object, in the exact
same way a list comprehension creates a list. Any list comprehension
you write, you can use to create an equivalent generator object,
just by swapping “("and")” for “ ["and"]”.
And you’re creating the object directly, without having to define a generator function to call. In other words, a generator expression is a convenient shortcut when you need a quick generator object:
# This...many_squares=(n*nforninrange(NUM_SQUARES))# ... is EXACTLY EQUIVALENT to this:defgen_many_squares(limit):forninrange(limit):yieldn*nmany_squares=gen_many_squares(NUM_SQUARES)
As far as Python is concerned, these two versions of many_squares
are completely equivalent.
Everything you know about list comprehensions applies to generator
expressions: multiple for clauses, if clauses, etc. You only need
to type the parentheses.
In fact, sometimes you can even omit them. When passing a generator
expression as an argument to a function, you will sometimes find
yourself typing (( followed by )). In that situation, Python
lets you omit the inner pair.
Imagine, for example, you are sorting a list of customer email addresses, looking at only those customers whose status is “active”:
>>># User is a class with "email" and "is_active" fields....# all_users is a list of User objects.>>># Sorted list of active user's email addresses....# Passing in a generator expression.>>>sorted((user.foruserinall_users...ifuser.is_active))['fred@a.com', 'sandy@f.net', 'tim@d.com']>>>>>># Omitting the inner parentheses....# Still passing in a generator expression!>>>sorted(user.foruserinall_users...ifuser.is_active)['fred@a.com', 'sandy@f.net', 'tim@d.com']
Notice how readable and natural this is (or will be, once you’ve practiced a bit). One thing to watch out for: you can only inline a generator expression this way when passing it to a function or method of one argument. Otherwise, you get a syntax error:
>>>>>># Reverse that list. Whoops......sorted(user.foruserinall_users...ifuser.is_active,reverse=True)File"<stdin>", line2SyntaxError:Generator expression must be parenthesized if not sole argument
Python cannot interpret what you mean here, because it is ambiguous in Python’s grammar. So you must use the inner parentheses:
>>># Okay, THIS will get the reversed list....sorted((user.foruserinall_users...ifuser.is_active),reverse=True)['tim@d.com', 'sandy@f.net', 'fred@a.com']
Sometimes it is more readable to assign the generator expression to a variable:
>>>active_emails=(...user.foruserinall_users...ifuser.is_active...)>>>sorted(active_emails,reverse=True)['tim@d.com', 'sandy@f.net', 'fred@a.com']
Generator expressions without parentheses suggest a unified way of thinking about comprehensions, that links generator expressions and list comprehensions together. Here’s a generator expression for a sequence of squares:
(n**2forninrange(10))
Here it is again, passed to the built-in list() function:
list(n**2forninrange(10))
And here it is as a list comprehension:
[n**2forninrange(10)]
When you understand generator expressions, it’s easy to see list comprehensions as a derivative data structure. The same applies for dictionary and set comprehensions (covered next). Even though Python does not work that way internally, this mental model is fully consistent with Python’s semantics.
With this insight, you start seeing new opportunities to use all these comprehension forms in your own code—improving readability, maintainability, and performance in the process.
If generator expressions are so great, why would you ever use list comprehensions? Generally speaking, your code will be more scalable and responsive if you use a generator expression. Except, of course, when you actually need a list. There are several considerations.
First, if the sequence is unlikely to be very big—and by “big”, I mean a minimum of thousands of elements long—you probably won’t benefit from using a generator expression. That’s just not big enough for the memory footprint to matter.
Next, generator expressions do not always fit the usage pattern you need. If you need random access, or to go through the sequence twice, generator expressions won’t work. Generator expressions also won’t work if you need to append or remove elements, or change the value at some index so that you can look it up later.
This is especially important when writing a method or function whose return value is a sequence. Do you return a generator expression, or a list comprehension?
In theory, there’s no reason to ever return a list instead of a
generator object; the caller can turn a generator object into a list
just by passing it to list(). In practice, the interface may be such
that the caller will want an actual list; forcing them to deal with a
generator object will just get in the way. Also, if you are
constructing the return value as a list within the function, it’s
silly to return a generator expression over it—just return the
actual list.
If your intention is to create a library usable by people who may not be advanced Pythonistas, that can be an argument for returning lists. Almost all programmers are familiar with list-like data structures. But fewer are familiar with how generators work in Python, and may—quite reasonably—get confused when confronted with a generator object.
Just like a list comprehension creates a list, a dictionary
comprehension creates a dictionary. You saw an example at the
beginning of this chapter; here’s another. Suppose you have this
Student class:
classStudent:def__init__(self,name,gpa,major):self.name=nameself.gpa=gpaself.major=major
Given a list named students, containing Student instances, we can
write a dictionary comprehension mapping student names to their GPAs:
>>>{student.name:student.gpaforstudentinstudents}{'Jim Smith': 3.6, 'Ryan Spencer': 3.1,'Penny Gilmore': 3.9, 'Alisha Jones': 2.5,'Todd Reynolds': 3.4}
The syntax differs from that of list comprehensions in two
ways. Instead of square brackets, you’re using curly braces—which
makes sense, since this creates a dictionary. The other difference is
the expression field, whose format is “key: value”, since a dict has
key-value pairs. So the structure is:
{KEY:VALUEforVARIABLEinSEQUENCE}
These are the only differences. Everything else you learned about
list comprehensions applies, including filtering with if clauses:
>>>definvert_name(name):...first,last=name.split(" ",1)...returnlast+", "+first...>>># Get "lastname, firstname" of high-GPA students....{invert_name(student.name):student.gpa...forstudentinstudents...ifstudent.gpa>3.5}{'Smith, Jim': 3.6, 'Gilmore, Penny': 3.9}
You can create sets too. Set comprehensions look exactly like list comprehensions, but with curly braces instead of square brackets:
>>># A list of student majors......[student.majorforstudentinstudents]['Computer Science', 'Economics', 'Computer Science','Economics', 'Basket Weaving']>>># And the same as a set:...{student.majorforstudentinstudents}{'Economics', 'Computer Science', 'Basket Weaving'}>>># You can also use the set() built-in....set(student.majorforstudentinstudents){'Economics', 'Computer Science', 'Basket Weaving'}
(How does Python distinguish between a set and dict comprehension?
dict’s expression is a key-value pair, while set’s is
a single value.)
What about tuple comprehensions? This is fun: strictly speaking,
Python doesn’t support them. However, you can pretend it does by using
tuple():
>>>tuple(student.gpaforstudentinstudents...ifstudent.major=="Computer Science")(3.6, 2.5)
This creates a tuple, but it’s not a tuple comprehension. You’re
calling the tuple constructor, and passing it a single
argument. What’s that argument? A generator expression! In other
words, you’re doing this:
>>>cs_students=(...student.gpaforstudentinstudents...ifstudent.major=="Computer Science"...)>>>type(cs_students)<class 'generator'>>>>tuple(cs_students)(3.6, 2.5)>>>>>># Same as:...tuple((student.gpaforstudentinstudents...ifstudent.major=="Computer Science"))(3.6, 2.5)>>># But you can omit the inner parentheses.
tuple’s constructor takes an iterator as an argument. The
cs_students is a generator object (created by the generator
expression), and a generator object is an iterator. So you can
pretend Python has tuple comprehensions, using “tuple(” as the
opener and “)” as the close. In fact, this also gives you alternate
ways to create dictionary and set comprehensions:
>>># Same as:...# { student.name: student.gpa for student in students }>>>dict((student.name,student.gpa)...forstudentinstudents){'Jim Smith': 3.6, 'Penny Gilmore': 3.9,'Alisha Jones': 2.5, 'Ryan Spencer': 3.1,'Todd Reynolds': 3.4}>>># Same as:...# { student.major for student in students }>>>set(student.majorforstudentinstudents){'Computer Science', 'Basket Weaving', 'Economics'}
Remember, when you pass a generator expression into a function, you can omit the inner parentheses. That’s why you can, for example, type
tuple(f(x)forxinnumbers)
Instead of
tuple((f(x)forxinnumbers))
One last point. Generator expressions are a scalable analog of list
comprehensions; is there any such equivalent for dicts, or for sets?
No, but you can still construct generator expressions
and pass the resulting generator object to their constructor, much
like you did with tuple.
For dict, you will want the yielded elements to be (key, value) tuples. For sets, it is maximally efficient to code that generator expression to only yield unique values. But that is not always worth the trouble; if duplicates are generated, the set constructor will handle it fine.
Comprehensions are a useful tool for readable, maintainable Python. Their sensible succinctness and high-level, declarative nature make them easy to write, easy to read, and easy to maintain. Use them more in your code, and you will find your Python experience greatly improved.
1 Technically, the condition does not have to depend on the variable. But useful examples of this are extremely rare.
2 I like to think future version control tools will automatically resolve this kind of situation. I believe it will require the tool to have knowledge of the language grammar, so it can parse and reason about different clauses in a line of code.
3 Refresher: The range() built-in returns an iterator over a sequence of integers, and can be called with 1, 2, or 3 arguments. The most general form is range(start, stop, step), beginning at start, going up to but not including stop, in increments of step. Called with two arguments, the step-size defaults to 1; with one argument, that argument is the stop, and the sequence starts at 0.