Iteration is one of Python’s strongest
features. At a high level, you might simply view iteration as a way to
process items in a sequence. However, there is so much more that is
possible, such as creating your own iterator objects, applying useful iteration
patterns in the itertools module, making generator functions, and so forth.
This chapter aims to address common problems involving iteration.
You need to process items in an iterable, but for whatever reason,
you can’t or don’t want to use a for loop.
To manually consume an iterable, use the next() function and
write your code to catch the StopIteration exception. For example,
this example manually reads lines from a file:
withopen('/etc/passwd')asf:try:whileTrue:line=next(f)(line,end='')exceptStopIteration:pass
Normally, StopIteration is used to signal the end of iteration. However,
if you’re using next() manually (as shown), you can also instruct it to
return a terminating value, such as None, instead. For example:
withopen('/etc/passwd')asf:whileTrue:line=next(f,None)iflineisNone:break(line,end='')
In most cases, the for statement is used to consume an iterable.
However, every now and then, a problem calls for more precise control
over the underlying iteration mechanism. Thus, it is useful to know
what actually happens.
The following interactive example illustrates the basic mechanics of what happens during iteration:
>>>items=[1,2,3]>>># Get the iterator>>>it=iter(items)# Invokes items.__iter__()>>># Run the iterator>>>next(it)# Invokes it.__next__()1>>>next(it)2>>>next(it)3>>>next(it)Traceback (most recent call last):File"<stdin>", line1, in<module>StopIteration>>>
Subsequent recipes in this chapter expand on iteration techniques, and knowledge of the basic iterator protocol is assumed. Be sure to tuck this first recipe away in your memory.
You have built a custom container object that internally holds a list, tuple, or some other iterable. You would like to make iteration work with your new container.
Typically, all you need to do is define an __iter__() method that
delegates iteration to the internally held container. For example:
classNode:def__init__(self,value):self._value=valueself._children=[]def__repr__(self):return'Node({!r})'.format(self._value)defadd_child(self,node):self._children.append(node)def__iter__(self):returniter(self._children)# Exampleif__name__=='__main__':root=Node(0)child1=Node(1)child2=Node(2)root.add_child(child1)root.add_child(child2)forchinroot:(ch)# Outputs Node(1), Node(2)
In this code, the __iter__() method simply forwards the iteration
request to the internally held _children attribute.
Python’s iterator protocol requires __iter__() to return a
special iterator object that implements a __next__() method to
carry out the actual iteration. If all you are doing is iterating
over the contents of another container, you don’t really
need to worry about the underlying details of how it works.
All you need to do is to forward the iteration request along.
The use of the iter() function here is a bit of
a shortcut that cleans up the code. iter(s) simply returns
the underlying iterator by calling s.__iter__(), much in the
same way that len(s) invokes s.__len__().
You want to implement a custom iteration pattern that’s
different than the usual built-in functions (e.g., range(),
reversed(), etc.).
If you want to implement a new kind of iteration pattern, define it using a generator function. Here’s a generator that produces a range of floating-point numbers:
deffrange(start,stop,increment):x=startwhilex<stop:yieldxx+=increment
To use such a function, you iterate over it using a for loop or
use it with some other function that consumes an iterable (e.g., sum(),
list(), etc.). For example:
>>>forninfrange(0,4,0.5):...(n)...00.51.01.52.02.53.03.5>>>list(frange(0,1,0.125))[0, 0.125, 0.25, 0.375, 0.5, 0.625, 0.75, 0.875]>>>
The mere presence of the yield statement in a function turns
it into a generator. Unlike a normal function, a generator
only runs in response to iteration. Here’s an experiment you
can try to see the underlying mechanics of how such a function works:
>>>defcountdown(n):...('Starting to count from',n)...whilen>0:...yieldn...n-=1...('Done!')...>>># Create the generator, notice no output appears>>>c=countdown(3)>>>c<generator object countdown at 0x1006a0af0>>>># Run to first yield and emit a value>>>next(c)Starting to count from 33>>># Run to the next yield>>>next(c)2>>># Run to next yield>>>next(c)1>>># Run to next yield (iteration stops)>>>next(c)Done!Traceback (most recent call last):File"<stdin>", line1, in<module>StopIteration>>>
The key feature is that a generator function only runs in response to
“next” operations carried out in iteration. Once a generator function
returns, iteration stops. However, the for statement that’s usually used to
iterate takes care of these details, so you don’t normally need to worry about them.
You are building custom objects on which you would like to support iteration, but would like an easy way to implement the iterator protocol.
By far, the easiest way to implement iteration on an object is to
use a generator function. In Recipe 4.2, a Node class
was presented for representing tree structures. Perhaps you want to implement an iterator that traverses nodes in a depth-first pattern.
Here is how you could do it:
classNode:def__init__(self,value):self._value=valueself._children=[]def__repr__(self):return'Node({!r})'.format(self._value)defadd_child(self,node):self._children.append(node)def__iter__(self):returniter(self._children)defdepth_first(self):yieldselfforcinself:yield fromc.depth_first()# Exampleif__name__=='__main__':root=Node(0)child1=Node(1)child2=Node(2)root.add_child(child1)root.add_child(child2)child1.add_child(Node(3))child1.add_child(Node(4))child2.add_child(Node(5))forchinroot.depth_first():(ch)# Outputs Node(0), Node(1), Node(3), Node(4), Node(2), Node(5)
In this code, the depth_first() method is simple to read and
describe. It first yields itself and then iterates over each child
yielding the items produced by the child’s depth_first() method
(using yield from).
Python’s iterator protocol requires __iter__() to return a
special iterator object that implements a __next__() operation and uses a
StopIteration exception to signal completion. However, implementing
such objects can often be a messy affair. For example, the following
code shows an alternative implementation of the depth_first() method
using an associated iterator class:
classNode:def__init__(self,value):self._value=valueself._children=[]def__repr__(self):return'Node({!r})'.format(self._value)defadd_child(self,other_node):self._children.append(other_node)def__iter__(self):returniter(self._children)defdepth_first(self):returnDepthFirstIterator(self)classDepthFirstIterator(object):'''Depth-first traversal'''def__init__(self,start_node):self._node=start_nodeself._children_iter=Noneself._child_iter=Nonedef__iter__(self):returnselfdef__next__(self):# Return myself if just started; create an iterator for childrenifself._children_iterisNone:self._children_iter=iter(self._node)returnself._node# If processing a child, return its next itemelifself._child_iter:try:nextchild=next(self._child_iter)returnnextchildexceptStopIteration:self._child_iter=Nonereturnnext(self)# Advance to the next child and start its iterationelse:self._child_iter=next(self._children_iter).depth_first()returnnext(self)
The DepthFirstIterator class works in the same way as the generator
version, but it’s a mess because the iterator has to maintain
a lot of complex state about where it is in the iteration process.
Frankly, nobody likes to write mind-bending code like that.
Define your iterator as a generator and be done with it.
Use the built-in reversed() function. For example:
>>>a=[1,2,3,4]>>>forxinreversed(a):...(x)...4321
Reversed iteration only works if the object in question has a size that can
be determined or if the object implements a __reversed__() special method.
If neither of these can be satisfied, you’ll have to convert the object
into a list first. For example:
# Print a file backwardsf=open('somefile')forlineinreversed(list(f)):(line,end='')
Be aware that turning an iterable into a list as shown could consume a lot of memory if it’s large.
Many programmers don’t realize that reversed iteration can be customized
on user-defined classes if they implement the __reversed__() method.
For example:
classCountdown:def__init__(self,start):self.start=start# Forward iteratordef__iter__(self):n=self.startwhilen>0:yieldnn-=1# Reverse iteratordef__reversed__(self):n=1whilen<=self.start:yieldnn+=1
Defining a reversed iterator makes the code much more efficient, as it’s no longer necessary to pull the data into a list and iterate in reverse on the list.
You would like to define a generator function, but it involves extra state that you would like to expose to the user somehow.
If you want a generator to expose extra state to the user,
don’t forget that you can easily implement it as a class, putting
the generator function code in the __iter__() method. For
example:
fromcollectionsimportdequeclasslinehistory:def__init__(self,lines,histlen=3):self.lines=linesself.history=deque(maxlen=histlen)def__iter__(self):forlineno,lineinenumerate(self.lines,1):self.history.append((lineno,line))yieldlinedefclear(self):self.history.clear()
To use this class, you would treat it like a normal generator
function. However, since it creates an instance, you can
access internal attributes, such as the history attribute or
the clear() method. For example:
withopen('somefile.txt')asf:lines=linehistory(f)forlineinlines:if'python'inline:forlineno,hlineinlines.history:('{}:{}'.format(lineno,hline),end='')
With generators, it is easy to fall into a trap of trying to do
everything with functions alone. This can lead to rather complicated
code if the generator function needs to interact with other parts of
your program in unusual ways (exposing attributes, allowing control
via method calls, etc.). If this is the case, just use a class
definition, as shown. Defining your generator in the __iter__()
method doesn’t change anything about how you write your algorithm.
The fact that it’s part of a class makes it easy for you to provide
attributes and methods for users to interact with.
One potential subtlety with the method shown is that it might require
an extra step of calling iter() if you are going to drive iteration
using a technique other than a for loop. For example:
>>>f=open('somefile.txt')>>>lines=linehistory(f)>>>next(lines)Traceback (most recent call last):File"<stdin>", line1, in<module>TypeError:'linehistory' object is not an iterator>>># Call iter() first, then start iterating>>>it=iter(lines)>>>next(it)'hello world\n'>>>next(it)'this is a test\n'>>>
You want to take a slice of data produced by an iterator, but the normal slicing operator doesn’t work.
The itertools.islice() function is perfectly suited
for taking slices of iterators and generators. For example:
>>>defcount(n):...whileTrue:...yieldn...n+=1...>>>c=count(0)>>>c[10:20]Traceback(mostrecentcalllast):File"<stdin>",line1,in<module>TypeError:'generator'objectisnotsubscriptable>>># Now using islice()>>>importitertools>>>forxinitertools.islice(c,10,20):...(x)...10111213141516171819>>>
Iterators and generators can’t normally be sliced, because no
information is known about their length (and they don’t implement
indexing). The result of islice() is an iterator that produces the
desired slice items, but it does this by consuming and discarding all
of the items up to the starting slice index. Further items are then
produced by the islice object until the ending index has been
reached.
It’s important to emphasize that islice() will consume data
on the supplied iterator. Since iterators can’t be rewound,
that is something to consider. If it’s important to go back, you
should probably just turn the data into a list first.
You want to iterate over items in an iterable, but the first few items aren’t of interest and you just want to discard them.
The itertools module has a few functions that can be used to
address this task. The first is the itertools.dropwhile() function.
To use it, you supply a function and an iterable. The returned
iterator discards the first items in the sequence as long as the
supplied function returns True. Afterward, the entirety of the
sequence is produced.
To illustrate, suppose you are reading a file that starts with a series of comment lines. For example:
>>>withopen('/etc/passwd')asf:...forlineinf:...(line,end='')...### User Database## Note that this file is consulted directly only when the system is running# in single-user mode. At other times, this information is provided by# Open Directory....##nobody:*:-2:-2:Unprivileged User:/var/empty:/usr/bin/falseroot:*:0:0:System Administrator:/var/root:/bin/sh...>>>
If you want to skip all of the initial comment lines, here’s one way to do it:
>>>fromitertoolsimportdropwhile>>>withopen('/etc/passwd')asf:...forlineindropwhile(lambdaline:line.startswith('#'),f):...(line,end='')...nobody:*:-2:-2:Unprivileged User:/var/empty:/usr/bin/falseroot:*:0:0:System Administrator:/var/root:/bin/sh...>>>
This example is based on skipping the first items according to a test
function. If you happen to know the exact number of items you want to
skip, then you can use itertools.islice() instead. For example:
>>>fromitertoolsimportislice>>>items=['a','b','c',1,4,10,15]>>>forxinislice(items,3,None):...(x)...141015>>>
In this example, the last None argument to islice() is required
to indicate that you want everything beyond the first three items
as opposed to only the first three items (e.g., a slice of [3:] as
opposed to a slice of [:3]).
The dropwhile() and islice() functions are mainly convenience functions
that you can use to avoid writing rather messy code such as this:
withopen('/etc/passwd')asf:# Skip over initial commentswhileTrue:line=next(f,'')ifnotline.startswith('#'):break# Process remaining lineswhileline:# Replace with useful processing(line,end='')line=next(f,None)
Discarding the first part of an iterable is also slightly different than simply filtering all of it. For example, the first part of this recipe might be rewritten as follows:
withopen('/etc/passwd')asf:lines=(lineforlineinfifnotline.startswith('#'))forlineinlines:(line,end='')
This will obviously discard the comment lines at the start, but will also discard all such lines throughout the entire file. On the other hand, the solution only discards items until an item no longer satisfies the supplied test. After that, all subsequent items are returned with no filtering.
Last, but not least, it should be emphasized that this recipe works with all iterables, including those whose size can’t be determined in advance. This includes generators, files, and similar kinds of objects.
You want to iterate over all of the possible combinations or permutations of a collection of items.
The itertools module provides three functions for this task.
The first of these—itertools.permutations()—takes a collection of items and produces
a sequence of tuples that rearranges all of the items into all
possible permutations (i.e., it shuffles them into all possible
configurations). For example:
>>>items=['a','b','c']>>>fromitertoolsimportpermutations>>>forpinpermutations(items):...(p)...('a', 'b', 'c')('a', 'c', 'b')('b', 'a', 'c')('b', 'c', 'a')('c', 'a', 'b')('c', 'b', 'a')>>>
If you want all permutations of a smaller length, you can give an optional length argument. For example:
>>>forpinpermutations(items,2):...(p)...('a', 'b')('a', 'c')('b', 'a')('b', 'c')('c', 'a')('c', 'b')>>>
Use itertools.combinations() to produce a sequence of combinations
of items taken from the input. For example:
>>>fromitertoolsimportcombinations>>>forcincombinations(items,3):...(c)...('a', 'b', 'c')>>>forcincombinations(items,2):...(c)...('a', 'b')('a', 'c')('b', 'c')>>>forcincombinations(items,1):...(c)...('a',)('b',)('c',)>>>
For combinations(), the actual order of the elements is not considered.
That is, the combination ('a', 'b') is considered to be the same as ('b',
'a') (which is not produced).
When producing combinations, chosen items are removed from the
collection of possible candidates (i.e., if 'a' has already been
chosen, then it is removed from consideration). The
itertools.combinations_with_replacement() function relaxes this, and
allows the same item to be chosen more than once. For example:
>>>forcincombinations_with_replacement(items,3):...(c)...('a', 'a', 'a')('a', 'a', 'b')('a', 'a', 'c')('a', 'b', 'b')('a', 'b', 'c')('a', 'c', 'c')('b', 'b', 'b')('b', 'b', 'c')('b', 'c', 'c')('c', 'c', 'c')>>>
This recipe demonstrates only some of the power found in the
itertools module. Although you could certainly write code to
produce permutations and combinations yourself, doing so would
probably require more than a fair bit of thought. When faced with
seemingly complicated iteration problems, it always pays to look
at itertools first. If the problem is common, chances are a solution
is already available.
You want to iterate over a sequence, but would like to keep track of which element of the sequence is currently being processed.
The built-in enumerate() function handles this quite nicely:
>>>my_list=['a','b','c']>>>foridx,valinenumerate(my_list):...(idx,val)...0 a1 b2 c
For printing output with canonical line numbers (where you typically start
the numbering at 1 instead of 0), you can pass in a start argument:
>>>my_list=['a','b','c']>>>foridx,valinenumerate(my_list,1):...(idx,val)...1 a2 b3 c
This case is especially useful for tracking line numbers in files should you want to use a line number in an error message:
defparse_data(filename):withopen(filename,'rt')asf:forlineno,lineinenumerate(f,1):fields=line.split()try:count=int(fields[1])...exceptValueErrorase:('Line {}: Parse error: {}'.format(lineno,e))
enumerate() can be handy for keeping track of the offset into a list for
occurrences of certain values, for example. So, if you want to map words in a
file to the lines in which they occur, it can easily be accomplished
using enumerate() to map each word to the line offset in the file where it
was found:
word_summary=defaultdict(list)withopen('myfile.txt','r')asf:lines=f.readlines()foridx,lineinenumerate(lines):# Create a list of words in current linewords=[w.strip().lower()forwinline.split()]forwordinwords:word_summary[word].append(idx)
If you print word_summary after processing the file, it’ll be a dictionary
(a defaultdict to be precise), and it’ll have a key for each word. The value
for each word-key will be a list of line numbers that word occurred on. If
the word occurred twice on a single line, that line number will be listed
twice, making it possible to identify various simple metrics about the text.
enumerate() is a nice shortcut for situations where you might be
inclined to keep your own counter variable. You could write code like this:
lineno=1forlineinf:# Process line...lineno+=1
But it’s usually much more elegant (and less error prone) to use enumerate()
instead:
forlineno,lineinenumerate(f):# Process line...
The value returned by enumerate() is an instance of an enumerate object,
which is an iterator that returns successive tuples consisting of a
counter and the value returned by calling next() on the sequence
you’ve passed in.
Although a minor point, it’s worth mentioning that sometimes it is
easy to get tripped up when applying enumerate() to a sequence
of tuples that are also being unpacked. To do it, you have to
write code like this:
data=[(1,2),(3,4),(5,6),(7,8)]# Correct!forn,(x,y)inenumerate(data):...# Error!forn,x,yinenumerate(data):...
To iterate over more than one sequence simultaneously, use the zip()
function. For example:
>>>xpts=[1,5,4,2,10,7]>>>ypts=[101,78,37,15,62,99]>>>forx,yinzip(xpts,ypts):...(x,y)...1 1015 784 372 1510 627 99>>>
zip(a, b) works by creating an iterator that produces tuples (x,
y) where x is taken from a and y is taken from b. Iteration
stops whenever one of the input sequences is exhausted. Thus, the
length of the iteration is the same as the length of the shortest input.
For example:
>>>a=[1,2,3]>>>b=['w','x','y','z']>>>foriinzip(a,b):...(i)...(1, 'w')(2, 'x')(3, 'y')>>>
If this behavior is not desired, use itertools.zip_longest() instead.
For example:
>>>fromitertoolsimportzip_longest>>>foriinzip_longest(a,b):...(i)...(1, 'w')(2, 'x')(3, 'y')(None, 'z')>>>foriinzip_longest(a,b,fillvalue=0):...(i)...(1, 'w')(2, 'x')(3, 'y')(0, 'z')>>>
zip() is commonly used whenever you need to pair data together. For
example, suppose you have a list of column headers and column values
like this:
headers=['name','shares','price']values=['ACME',100,490.1]
Using zip(), you can pair the values together to make a dictionary like
this:
s=dict(zip(headers,values))
Alternatively, if you are trying to produce output, you can write code like this:
forname,valinzip(headers,values):(name,'=',val)
It’s less common, but zip() can be passed more than two sequences as input.
For this case, the resulting tuples have the same number of items in them
as the number of input sequences. For example:
>>>a=[1,2,3]>>>b=[10,11,12]>>>c=['x','y','z']>>>foriinzip(a,b,c):...(i)...(1, 10, 'x')(2, 11, 'y')(3, 12, 'z')>>>
Last, but not least, it’s important to emphasize that zip() creates
an iterator as a result. If you need the paired values stored in a
list, use the list() function. For example:
>>>zip(a,b)<zip object at 0x1007001b8>>>>list(zip(a,b))[(1, 10), (2, 11), (3, 12)]>>>
You need to perform the same operation on many objects, but the objects are contained in different containers, and you’d like to avoid nested loops without losing the readability of your code.
The itertools.chain() method can be used to simplify this task. It
takes a list of iterables as input, and returns an iterator that
effectively masks the fact that you’re really acting on multiple
containers. To illustrate, consider this example:
>>>fromitertoolsimportchain>>>a=[1,2,3,4]>>>b=['x','y','z']>>>forxinchain(a,b):...(x)...1234xyz>>>
A common use of chain() is in programs where you would like to perform
certain operations on all of the items at once but the items are pooled into different working sets. For example:
# Various working sets of itemsactive_items=set()inactive_items=set()# Iterate over all itemsforiteminchain(active_items,inactive_items):# Process item...
This solution is much more elegant than using two separate loops, as in the following:
foriteminactive_items:# Process item...foritemininactive_items:# Process item...
itertools.chain() accepts one or more iterables as arguments. It
then works by creating an iterator that successively consumes and
returns the items produced by each of the supplied iterables you
provided. It’s a subtle distinction, but chain() is more
efficient than first combining the sequences and iterating. For example:
# Inefficentforxina+b:...# Betterforxinchain(a,b):...
In the first case, the operation a + b creates an entirely new sequence
and additionally requires a and b to be of the same type.
chain() performs no such operation, so it’s far more efficient with
memory if the input sequences are large and it can be easily applied when
the iterables in question are of different types.
You want to process data iteratively in the style of a data processing pipeline (similar to Unix pipes). For instance, you have a huge amount of data that needs to be processed, but it can’t fit entirely into memory.
Generator functions are a good way to implement processing pipelines. To illustrate, suppose you have a huge directory of log files that you want to process:
foo/
access-log-012007.gz
access-log-022007.gz
access-log-032007.gz
...
access-log-012008
bar/
access-log-092007.bz2
...
access-log-022008Suppose each file contains lines of data like this:
124.115.6.12 - - [10/Jul/2012:00:18:50 -0500] "GET /robots.txt ..." 200 71
210.212.209.67 - - [10/Jul/2012:00:18:51 -0500] "GET /ply/ ..." 200 11875
210.212.209.67 - - [10/Jul/2012:00:18:51 -0500] "GET /favicon.ico ..." 404 369
61.135.216.105 - - [10/Jul/2012:00:20:04 -0500] "GET /blog/atom.xml ..." 304 -
...To process these files, you could define a collection of small generator functions that perform specific self-contained tasks. For example:
importosimportfnmatchimportgzipimportbz2importredefgen_find(filepat,top):'''Find all filenames in a directory tree that match a shell wildcard pattern'''forpath,dirlist,filelistinos.walk(top):fornameinfnmatch.filter(filelist,filepat):yieldos.path.join(path,name)defgen_opener(filenames):'''Open a sequence of filenames one at a time producing a file object.The file is closed immediately when proceeding to the next iteration.'''forfilenameinfilenames:iffilename.endswith('.gz'):f=gzip.open(filename,'rt')eliffilename.endswith('.bz2'):f=bz2.open(filename,'rt')else:f=open(filename,'rt')yieldff.close()defgen_concatenate(iterators):'''Chain a sequence of iterators together into a single sequence.'''foritiniterators:yield fromitdefgen_grep(pattern,lines):'''Look for a regex pattern in a sequence of lines'''pat=re.compile(pattern)forlineinlines:ifpat.search(line):yieldline
You can now easily stack these functions together to make a processing pipeline. For example, to find all log lines that contain the word python, you would just do this:
lognames=gen_find('access-log*','www')files=gen_opener(lognames)lines=gen_concatenate(files)pylines=gen_grep('(?i)python',lines)forlineinpylines:(line)
If you want to extend the pipeline further, you can even feed the data in generator expressions. For example, this version finds the number of bytes transferred and sums the total:
lognames=gen_find('access-log*','www')files=gen_opener(lognames)lines=gen_concatenate(files)pylines=gen_grep('(?i)python',lines)bytecolumn=(line.rsplit(None,1)[1]forlineinpylines)bytes=(int(x)forxinbytecolumnifx!='-')('Total',sum(bytes))
Processing data in a pipelined manner works well for a wide variety of other problems, including parsing, reading from real-time data sources, periodic polling, and so on.
In understanding the code, it is important to grasp that
the yield statement acts as a kind of data producer whereas a
for loop acts as a data consumer. When the generators are
stacked together, each yield feeds a single item of data
to the next stage of the pipeline that is consuming it with iteration.
In the last example, the sum() function is actually driving the
entire program, pulling one item at a time out of the pipeline
of generators.
One nice feature of this approach is that each generator function tends to be small and self-contained. As such, they are easy to write and maintain. In many cases, they are so general purpose that they can be reused in other contexts. The resulting code that glues the components together also tends to read like a simple recipe that is easily understood.
The memory efficiency of this approach can also not be overstated. The code shown would still work even if used on a massive directory of files. In fact, due to the iterative nature of the processing, very little memory would be used at all.
There is a bit of extreme subtlety involving the gen_concatenate()
function. The purpose of this function is to concatenate input sequences
together into one long sequence of lines. The itertools.chain()
function performs a similar function, but requires that all of the
chained iterables be specified as arguments. In the case of this
particular recipe, doing that would involve a statement such as lines
= itertools.chain(*files), which would cause the gen_opener()
generator to be fully consumed. Since that generator is producing a
sequence of open files that are immediately closed in the next
iteration step, chain() can’t be used. The solution shown avoids
this issue.
Also appearing in the gen_concatenate() function is the use of yield from
to delegate to a subgenerator. The statement yield from it simply makes
gen_concatenate() emit all of the values produced by the generator it.
This is described further in Recipe 4.14.
Last, but not least, it should be noted that a pipelined approach doesn’t always work for every data handling problem. Sometimes you just need to work with all of the data at once. However, even in that case, using generator pipelines can be a way to logically break a problem down into a kind of workflow.
David Beazley has written extensively about these techniques in his “Generator Tricks for Systems Programmers” tutorial presentation. Consult that for even more examples.
This is easily solved by writing a recursive generator function
involving a yield from statement. For example:
fromcollectionsimportIterabledefflatten(items,ignore_types=(str,bytes)):forxinitems:ifisinstance(x,Iterable)andnotisinstance(x,ignore_types):yield fromflatten(x,ignore_types)else:yieldxitems=[1,2,[3,4,[5,6],7],8]# Produces 1 2 3 4 5 6 7 8forxinflatten(items):(x)
In the code, the isinstance(x, Iterable) simply checks to see
if an item is iterable. If so, yield from is used to emit
all of its values as a kind of subroutine. The end result is a
single sequence of output with no nesting.
The extra argument ignore_types and the check for not isinstance(x,
ignore_types) is there to prevent strings and bytes from being
interpreted as iterables and expanded as individual characters. This
allows nested lists of strings to work in the way that most people
would expect. For example:
>>>items=['Dave','Paula',['Thomas','Lewis']]>>>forxinflatten(items):...(x)...DavePaulaThomasLewis>>>
The yield from statement is a nice shortcut to use if you ever
want to write generators that call other generators as subroutines.
If you don’t use it, you need to write code that uses an extra
for loop. For example:
defflatten(items,ignore_types=(str,bytes)):forxinitems:ifisinstance(x,Iterable)andnotisinstance(x,ignore_types):foriinflatten(x):yieldielse:yieldx
Although it’s only a minor change, the yield from statement just feels better
and leads to cleaner code.
As noted, the extra check for strings and bytes is there to prevent the
expansion of those types into individual characters. If there are
other types that you don’t want expanded, you can supply a different
value for the ignore_types argument.
Finally, it should be noted that yield from has a more important
role in advanced programs involving coroutines and generator-based
concurrency. See Recipe 12.12 for another example.
You have a collection of sorted sequences and you want to iterate over a sorted sequence of them all merged together.
The heapq.merge() function does exactly what you want.
For example:
>>>importheapq>>>a=[1,4,7,10]>>>b=[2,5,6,11]>>>forcinheapq.merge(a,b):...(c)...1245671011
The iterative nature of heapq.merge means that it never reads
any of the supplied sequences all at once. This means that you
can use it on very long sequences with very little overhead.
For instance, here is an example of how you would merge two sorted files:
importheapqwithopen('sorted_file_1','rt')asfile1,\open('sorted_file_2')'rt'asfile2,\open('merged_file','wt')asoutf:forlineinheapq.merge(file1,file2):outf.write(line)
It’s important to emphasize that heapq.merge() requires that all of
the input sequences already be sorted. In particular, it does not
first read all of the data into a heap or do any preliminary sorting.
Nor does it perform any kind of validation of the inputs to check if
they meet the ordering requirements. Instead, it simply examines the
set of items from the front of each input sequence and emits
the smallest one found. A new item from the chosen sequence is then
read, and the process repeats itself until all input sequences have
been fully consumed.
You have code that uses a while loop to iteratively process
data because it involves a function or some kind of unusual
test condition that doesn’t fall into the usual iteration pattern.
A somewhat common scenario in programs involving I/O is to write code like this:
CHUNKSIZE=8192defreader(s):whileTrue:data=s.recv(CHUNKSIZE)ifdata==b'':breakprocess_data(data)
Such code can often be replaced using iter(), as follows:
defreader(s):forchunkiniter(lambda:s.recv(CHUNKSIZE),b''):process_data(chunk)
If you’re a bit skeptical that it might work, you can try a similar example involving files. For example:
>>>importsys>>>f=open('/etc/passwd')>>>forchunkiniter(lambda:f.read(10),''):...n=sys.stdout.write(chunk)...nobody:*:-2:-2:Unprivileged User:/var/empty:/usr/bin/falseroot:*:0:0:System Administrator:/var/root:/bin/shdaemon:*:1:1:System Services:/var/root:/usr/bin/false_uucp:*:4:4:Unix to Unix Copy Protocol:/var/spool/uucp:/usr/sbin/uucico...>>>
A little-known feature of the built-in iter() function is that
it optionally accepts a zero-argument callable and sentinel (terminating)
value as inputs. When used in this way, it creates an iterator that
repeatedly calls the supplied callable over and over again until it
returns the value given as a sentinel.
This particular approach works well with certain kinds of repeatedly
called functions, such as those involving I/O. For example, if you want
to read data in chunks from sockets or files, you usually have to repeatedly
execute read() or recv() calls followed by an end-of-file test. This
recipe simply takes these two features and combines them together into
a single iter() call. The use of lambda in the solution is needed
to create a callable that takes no arguments, yet still supplies the
desired size argument to recv() or read().