All programs need to perform input and output. This chapter covers common idioms for working with different kinds of files, including text and binary files, file encodings, and other related matters. Techniques for manipulating filenames and directories are also covered.
You need to read or write text data, possibly in different text encodings such as ASCII, UTF-8, or UTF-16.
Use the open() function with mode rt to read a text file. For example:
# Read the entire file as a single stringwithopen('somefile.txt','rt')asf:data=f.read()# Iterate over the lines of the filewithopen('somefile.txt','rt')asf:forlineinf:# process line...
Similarly, to write a text file, use open() with mode wt to
write a file, clearing and overwriting the previous contents (if any). For example:
# Write chunks of text datawithopen('somefile.txt','wt')asf:f.write(text1)f.write(text2)...# Redirected print statementwithopen('somefile.txt','wt')asf:(line1,file=f)(line2,file=f)...
To append to the end of an existing file, use open() with mode
at.
By default, files are read/written using the system default text
encoding, as can be found in sys.getdefaultencoding(). On most
machines, this is set to utf-8. If you know that the text you are
reading or writing is in a different encoding, supply the optional
encoding parameter to open(). For example:
withopen('somefile.txt','rt',encoding='latin-1')asf:...
Python understands several hundred possible text encodings. However,
some of the more common encodings are ascii, latin-1,
utf-8, and utf-16. UTF-8 is usually a safe bet if working
with web applications. ascii corresponds to the 7-bit characters
in the range U+0000 to U+007F. latin-1 is a direct mapping of
bytes 0-255 to Unicode characters U+0000 to U+00FF. latin-1
encoding is notable in that it will never produce a decoding error
when reading text of a possibly unknown encoding. Reading a file as
latin-1 might not produce a completely correct text decoding, but it
still might be enough to extract useful data out of it.
Also, if you later write the data back out, the original input data will be
preserved.
Reading and writing text files is typically very straightforward.
However, there are a number of subtle aspects to keep in mind. First,
the use of the with statement in the examples establishes a context
in which the file will be used. When control leaves the with block,
the file will be closed automatically. You don’t need to use the with statement,
but if you don’t use it, make sure you remember to close the file:
f=open('somefile.txt','rt')data=f.read()f.close()
Another minor complication concerns the recognition of newlines, which are
different on Unix and Windows (i.e., \n versus \r\n). By default,
Python operates in what’s known as “universal newline” mode. In this mode,
all common newline conventions are recognized, and newline characters
are converted to a single \n character while reading. Similarly, the
newline character \n is converted to the system default newline character
on output. If you don’t want this translation, supply the newline='' argument
to open(), like this:
# Read with disabled newline translationwithopen('somefile.txt','rt',newline='')asf:...
To illustrate the difference, here’s what you will see on a Unix machine
if you read the contents of a Windows-encoded text file containing
the raw data hello world!\r\n:
>>># Newline translation enabled (the default)>>>f=open('hello.txt','rt')>>>f.read()'hello world!\n'>>># Newline translation disabled>>>g=open('hello.txt','rt',newline='')>>>g.read()'hello world!\r\n'>>>
A final issue concerns possible encoding errors in text files. When reading or writing a text file, you might encounter an encoding or decoding error. For instance:
>>>f=open('sample.txt','rt',encoding='ascii')>>>f.read()Traceback (most recent call last):File"<stdin>", line1, in<module>File"/usr/local/lib/python3.3/encodings/ascii.py", line26, indecodereturncodecs.ascii_decode(input,self.errors)[0]UnicodeDecodeError:'ascii' codec can't decode byte 0xc3 in position12: ordinal not in range(128)>>>
If you get this error, it usually means that you’re not reading the
file in the correct encoding. You should carefully read the
specification of whatever it is that you’re reading and check that
you’re doing it right (e.g., reading data as UTF-8 instead of Latin-1
or whatever it needs to be). If encoding errors are still a possibility, you can
supply an optional errors argument to open() to deal with the
errors. Here are a few samples of common error handling schemes:
>>># Replace bad chars with Unicode U+fffd replacement char>>>f=open('sample.txt','rt',encoding='ascii',errors='replace')>>>f.read()'Spicy Jalape?o!'>>># Ignore bad chars entirely>>>g=open('sample.txt','rt',encoding='ascii',errors='ignore')>>>g.read()'Spicy Jalapeo!'>>>
If you’re constantly fiddling with the encoding and errors arguments to open() and doing lots of
hacks, you’re probably making life more difficult than it needs to be.
The number one rule with text is that you simply need to make sure
you’re always using the proper text encoding. When in doubt, use the
default setting (typically UTF-8).
Use the file keyword argument to print(), like this:
withopen('somefile.txt','wt')asf:('hello world',file=f)
There’s not much more to printing to a file other than this. However, make sure that the file is opened in text mode. Printing will fail if the underlying file is in binary mode.
You want to output data using print(), but you also want to change the
separator character or line ending.
Use the sep and end keyword arguments to print() to change
the output as you wish. For example:
>>>('ACME',50,91.5)ACME 50 91.5>>>('ACME',50,91.5,sep=',')ACME,50,91.5>>>('ACME',50,91.5,sep=',',end='!!\n')ACME,50,91.5!!>>>
Use of the end argument is also how you suppress the output of newlines in
output. For example:
>>>foriinrange(5):...(i)...01234>>>foriinrange(5):...(i,end=' ')...0 1 2 3 4 >>>
Using print() with a different item separator is often the easiest way to
output data when you need something other than a space separating the items.
Sometimes you’ll see programmers using str.join() to accomplish the same
thing. For example:
>>>(','.join('ACME','50','91.5'))ACME,50,91.5>>>
The problem with str.join() is that it only works with strings. This means
that it’s often necessary to perform various acrobatics to get it to work.
For example:
>>>row=('ACME',50,91.5)>>>(','.join(row))Traceback (most recent call last):File"<stdin>", line1, in<module>TypeError:sequence item 1: expected str instance, int found>>>(','.join(str(x)forxinrow))ACME,50,91.5>>>
Instead of doing that, you could just write the following:
>>>(*row,sep=',')ACME,50,91.5>>>
Use the open() function with mode rb or wb to read or write binary data.
For example:
# Read the entire file as a single byte stringwithopen('somefile.bin','rb')asf:data=f.read()# Write binary data to a filewithopen('somefile.bin','wb')asf:f.write(b'Hello World')
When reading binary, it is important to stress that all data returned
will be in the form of byte strings, not text strings. Similarly,
when writing, you must supply data in the form of objects that expose
data as bytes (e.g., byte strings, bytearray objects, etc.).
When reading binary data, the subtle semantic differences between byte strings and text strings pose a potential gotcha. In particular, be aware that indexing and iteration return integer byte values instead of byte strings. For example:
>>># Text string>>>t='Hello World'>>>t[0]'H'>>>forcint:...(c)...Hello...>>># Byte string>>>b=b'Hello World'>>>b[0]72>>>forcinb:...(c)...72101108108111...>>>
If you ever need to read or write text from a binary-mode file, make sure you remember to decode or encode it. For example:
withopen('somefile.bin','rb')asf:data=f.read(16)text=data.decode('utf-8')withopen('somefile.bin','wb')asf:text='Hello World'f.write(text.encode('utf-8'))
A lesser-known aspect of binary I/O is that objects such as arrays and C structures can
be used for writing without any kind of intermediate conversion to a bytes object. For example:
importarraynums=array.array('i',[1,2,3,4])withopen('data.bin','wb')asf:f.write(nums)
This applies to any object that implements the so-called “buffer interface,” which directly exposes an underlying memory buffer to operations that can work with it. Writing binary data is one such operation.
Many objects also allow binary data to be directly read into their underlying memory using the
readinto() method of files. For example:
>>>importarray>>>a=array.array('i',[0,0,0,0,0,0,0,0])>>>withopen('data.bin','rb')asf:...f.readinto(a)...16>>>aarray('i', [1, 2, 3, 4, 0, 0, 0, 0])>>>
However, great care should be taken when using this technique, as it is often platform specific and may depend on such things as the word size and byte ordering (i.e., big endian versus little endian). See Recipe 5.9 for another example of reading binary data into a mutable buffer.
This problem is easily solved by using the little-known x mode to open()
instead of the usual w mode. For example:
>>>withopen('somefile','wt')asf:...f.write('Hello\n')...>>>withopen('somefile','xt')asf:...f.write('Hello\n')...Traceback (most recent call last):File"<stdin>", line1, in<module>FileExistsError:[Errno 17] File exists: 'somefile'>>>
If the file is binary mode, use mode xb instead of xt.
This recipe illustrates an extremely elegant solution to a problem that sometimes arises when writing files (i.e., accidentally overwriting an existing file). An alternative solution is to first test for the file like this:
>>>importos>>>ifnotos.path.exists('somefile'):...withopen('somefile','wt')asf:...f.write('Hello\n')...else:...('File already exists!')...File already exists!>>>
Clearly, using the x file mode is a lot more straightforward. It is
important to note that the x mode is a Python 3 specific extension
to the open() function. In particular, no such mode exists in
earlier Python versions or the underlying C libraries used in
Python’s implementation.
You want to feed a text or binary string to code that’s been written to operate on file-like objects instead.
Use the io.StringIO() and io.BytesIO() classes to create
file-like objects that operate on string data. For example:
>>>s=io.StringIO()>>>s.write('Hello World\n')12>>>('This is a test',file=s)15>>># Get all of the data written so far>>>s.getvalue()'Hello World\nThis is a test\n'>>>>>># Wrap a file interface around an existing string>>>s=io.StringIO('Hello\nWorld\n')>>>s.read(4)'Hell'>>>s.read()'o\nWorld\n'>>>
The io.StringIO class should only be used for text. If you are
operating with binary data, use the io.BytesIO class instead. For
example:
>>>s=io.BytesIO()>>>s.write(b'binary data')>>>s.getvalue()b'binary data'>>>
The StringIO and BytesIO classes are most useful in scenarios
where you need to mimic a normal file for some reason. For example,
in unit tests, you might use StringIO to create a file-like object
containing test data that’s fed into a function that would otherwise
work with a normal file.
Be aware that StringIO and BytesIO instances don’t have a proper
integer file-descriptor. Thus, they do not work with code that
requires the use of a real system-level file such as a file, pipe, or
socket.
The gzip and bz2 modules make it easy to work with such files.
Both modules provide an alternative implementation of open() that can
be used for this purpose. For example, to read compressed files as
text, do this:
# gzip compressionimportgzipwithgzip.open('somefile.gz','rt')asf:text=f.read()# bz2 compressionimportbz2withbz2.open('somefile.bz2','rt')asf:text=f.read()
Similarly, to write compressed data, do this:
# gzip compressionimportgzipwithgzip.open('somefile.gz','wt')asf:f.write(text)# bz2 compressionimportbz2withbz2.open('somefile.bz2','wt')asf:f.write(text)
As shown, all I/O will use text and perform Unicode encoding/decoding. If you want to
work with binary data instead, use a file mode of rb or wb.
For the most part, reading or writing compressed data is straightforward. However, be
aware that choosing the correct file mode is critically important.
If you don’t specify a mode, the default mode is binary, which will break
programs that expect to receive text. Both gzip.open() and bz2.open()
accept the same parameters as the built-in open() function, including
encoding, errors, newline, and so forth.
When writing compressed data, the compression level can be optionally specified
using the compresslevel keyword argument. For example:
withgzip.open('somefile.gz','wt',compresslevel=5)asf:f.write(text)
The default level is 9, which provides the highest level of compression. Lower levels offer better performance, but not as much compression.
Finally, a little-known feature of gzip.open() and bz2.open() is that
they can be layered on top of an existing file opened in binary mode.
For example, this works:
importgzipf=open('somefile.gz','rb')withgzip.open(f,'rt')asg:text=g.read()
This allows the gzip and bz2 modules to work with various
file-like objects such as sockets, pipes, and in-memory files.
Instead of iterating over a file by lines, you want to iterate over a collection of fixed-sized records or chunks.
Use the iter() function and functools.partial() using this neat trick:
fromfunctoolsimportpartialRECORD_SIZE=32withopen('somefile.data','rb')asf:records=iter(partial(f.read,RECORD_SIZE),b'')forrinrecords:...
The records object in this example is an iterable that will
produce fixed-sized chunks until the end of the file is reached.
However, be aware that the last item may have fewer bytes than
expected if the file size is not an exact multiple of the record
size.
A little-known feature of the iter() function is that it can
create an iterator if you pass it a callable and a sentinel value.
The resulting iterator simply calls the supplied callable
over and over again until it returns the sentinel, at which point
iteration stops.
In the solution, the functools.partial is used to create a
callable that reads a fixed number of bytes from a file each
time it’s called. The sentinel of b'' is what gets returned
when a file is read but the end of file has been reached.
Last, but not least, the solution shows the file being opened in binary mode. For reading fixed-sized records, this would probably be the most common case. For text files, reading line by line (the default iteration behavior) is more common.
You want to read binary data directly into a mutable buffer without any intermediate copying. Perhaps you want to mutate the data in-place and write it back out to a file.
To read data into a mutable array, use the readinto() method of files.
For example:
importos.pathdefread_into_buffer(filename):buf=bytearray(os.path.getsize(filename))withopen(filename,'rb')asf:f.readinto(buf)returnbuf
Here is an example that illustrates the usage:
>>># Write a sample file>>>withopen('sample.bin','wb')asf:...f.write(b'Hello World')...>>>buf=read_into_buffer('sample.bin')>>>bufbytearray(b'Hello World')>>>buf[0:5]=b'Hallo'>>>bufbytearray(b'Hallo World')>>>withopen('newsample.bin','wb')asf:...f.write(buf)...11>>>
The readinto() method of files can be used to fill any preallocated
array with data. This even includes arrays created from the array module
or libraries such as numpy. Unlike the normal read() method, readinto()
fills the contents of an existing buffer rather than allocating new objects
and returning them. Thus, you might be able to use it to avoid making
extra memory allocations. For example, if you are reading a binary
file consisting of equally sized records, you can write code like this:
record_size=32# Size of each record (adjust value)buf=bytearray(record_size)withopen('somefile','rb')asf:whileTrue:n=f.readinto(buf)ifn<record_size:break# Use the contents of buf...
Another interesting feature to use here might be a memoryview, which lets you make zero-copy slices of an existing buffer and even change its contents. For example:
>>>bufbytearray(b'Hello World')>>>m1=memoryview(buf)>>>m2=m1[-5:]>>>m2<memoryat0x100681390>>>>m2[:]=b'WORLD'>>>bufbytearray(b'Hello WORLD')>>>
One caution with using f.readinto() is that you must always make sure to check its
return code, which is the number of bytes actually read.
If the number of bytes is smaller than the size of the supplied buffer, it might indicate truncated or corrupted data (e.g., if you were expecting an exact number of bytes to be read).
Finally, be on the lookout for other “into” related functions in
various library modules (e.g., recv_into(), pack_into(), etc.).
Many other parts of Python have support for direct I/O or data access
that can be used to fill or alter the contents of arrays and buffers.
See Recipe 6.12 for a significantly more advanced example of interpreting binary structures and usage of memoryviews.
You want to memory map a binary file into a mutable byte array, possibly for random access to its contents or to make in-place modifications.
Use the mmap module to memory map files. Here is a utility function
that shows how to open a file and memory map it in a portable manner:
importosimportmmapdefmemory_map(filename,access=mmap.ACCESS_WRITE):size=os.path.getsize(filename)fd=os.open(filename,os.O_RDWR)returnmmap.mmap(fd,size,access=access)
To use this function, you would need to have a file already created and filled with data. Here is an example of how you could initially create a file and expand it to a desired size:
>>>size=1000000>>>withopen('data','wb')asf:...f.seek(size-1)...f.write(b'\x00')...>>>
Now here is an example of memory mapping the contents using
the memory_map() function:
>>>m=memory_map('data')>>>len(m)1000000>>>m[0:10]b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'>>>m[0]0>>># Reassign a slice>>>m[0:11]=b'Hello World'>>>m.close()>>># Verify that changes were made>>>withopen('data','rb')asf:...(f.read(11))...b'Hello World'>>>
The mmap object returned by mmap() can also be used as a context manager, in
which case the underlying file is closed automatically. For example:
>>>withmemory_map('data')asm:...(len(m))...(m[0:10])...1000000b'Hello World'>>>m.closedTrue>>>
By default, the memory_map() function shown opens a file for both reading
and writing. Any modifications made to the data are copied back to the
original file. If read-only access is needed instead, supply mmap.ACCESS_READ
for the access argument. For example:
m=memory_map(filename,mmap.ACCESS_READ)
If you intend to modify the data locally, but don’t want those changes
written back to the original file, use mmap.ACCESS_COPY:
m=memory_map(filename,mmap.ACCESS_COPY)
Using mmap to map files into memory can be an efficient and
elegant means for randomly accessing the contents of a file. For
example, instead of opening a file and performing various combinations
of seek(), read(), and write() calls, you can simply map the file
and access the data using slicing operations.
Normally, the memory exposed by mmap() looks like a bytearray object.
However, you can interpret the data differently using a memoryview. For
example:
>>>m=memory_map('data')>>># Memoryview of unsigned integers>>>v=memoryview(m).cast('I')>>>v[0]=7>>>m[0:4]b'\x07\x00\x00\x00'>>>m[0:4]=b'\x07\x01\x00\x00'>>>v[0]263>>>
It should be emphasized that memory mapping a file does not cause the entire file to be read into memory. That is, it’s not copied into some kind of memory buffer or array. Instead, the operating system merely reserves a section of virtual memory for the file contents. As you access different regions, those portions of the file will be read and mapped into the memory region as needed. However, parts of the file that are never accessed simply stay on disk. This all happens transparently, behind the scenes.
If more than one Python interpreter memory maps the same file, the resulting
mmap object can be used to exchange data between interpreters. That is,
all interpreters can read/write data simultaneously, and changes made to
the data in one interpreter will automatically appear in the others.
Obviously, some extra care is required to synchronize things, but this
kind of approach is sometimes used as an alternative to transmitting data
in messages over pipes or sockets.
As shown, this recipe has been written to be as general purpose as
possible, working on both Unix and Windows. Be aware that there are
some platform differences concerning the use of the mmap() call
hidden behind the scenes. In addition, there are options to create
anonymously mapped memory regions. If this is of interest to you,
make sure you carefully read the Python documentation
on the subject.
You need to manipulate pathnames in order to find the base filename, directory name, absolute path, and so on.
To manipulate pathnames, use the functions in the os.path module.
Here is an interactive example that illustrates a few key features:
>>>importos>>>path='/Users/beazley/Data/data.csv'>>># Get the last component of the path>>>os.path.basename(path)'data.csv'>>># Get the directory name>>>os.path.dirname(path)'/Users/beazley/Data'>>># Join path components together>>>os.path.join('tmp','data',os.path.basename(path))'tmp/data/data.csv'>>># Expand the user's home directory>>>path='~/Data/data.csv'>>>os.path.expanduser(path)'/Users/beazley/Data/data.csv'>>># Split the file extension>>>os.path.splitext(path)('~/Data/data', '.csv')>>>
For any manipulation of filenames, you should use the os.path module
instead of trying to cook up your own code using the standard string
operations. In part, this is for portability. The os.path module
knows about differences between Unix and Windows and can reliably
deal with filenames such as Data/data.csv and Data\data.csv.
Second, you really shouldn’t spend your time reinventing the wheel.
It’s usually best to use the functionality that’s already provided for you.
It should be noted that the os.path module has many more features not
shown in this recipe. Consult the documentation for more functions
related to file testing, symbolic links, and so forth.
Use the os.path module to test for the existence of a file or
directory. For example:
>>>importos>>>os.path.exists('/etc/passwd')True>>>os.path.exists('/tmp/spam')False>>>
You can perform further tests to see what kind of file it might be. These tests return False if the file in question doesn’t
exist:
>>># Is a regular file>>>os.path.isfile('/etc/passwd')True>>># Is a directory>>>os.path.isdir('/etc/passwd')False>>># Is a symbolic link>>>os.path.islink('/usr/local/bin/python3')True>>># Get the file linked to>>>os.path.realpath('/usr/local/bin/python3')'/usr/local/bin/python3.3'>>>
If you need to get metadata (e.g., the file size or modification
date), that is also available in the os.path module.
>>>os.path.getsize('/etc/passwd')3669>>>os.path.getmtime('/etc/passwd')1272478234.0>>>importtime>>>time.ctime(os.path.getmtime('/etc/passwd'))'Wed Apr 28 13:10:34 2010'>>>
File testing is a straightforward operation using os.path.
Probably the only thing to be aware of when writing scripts is that
you might need to worry about permissions—especially for
operations that get metadata. For example:
>>>os.path.getsize('/Users/guido/Desktop/foo.txt')Traceback (most recent call last):File"<stdin>", line1, in<module>File"/usr/local/lib/python3.3/genericpath.py", line49, ingetsizereturnos.stat(filename).st_sizePermissionError:[Errno 13] Permission denied: '/Users/guido/Desktop/foo.txt'>>>
Use the os.listdir() function to obtain a list of files in a directory:
importosnames=os.listdir('somedir')
This will give you the raw directory listing, including all files, subdirectories,
symbolic links, and so forth. If you need to filter the data in some way, consider
using a list comprehension combined with various functions in the os.path library.
For example:
importos.path# Get all regular filesnames=[namefornameinos.listdir('somedir')ifos.path.isfile(os.path.join('somedir',name))]# Get all dirsdirnames=[namefornameinos.listdir('somedir')ifos.path.isdir(os.path.join('somedir',name))]
The startswith() and endswith() methods of strings can be useful
for filtering the contents of a directory as well. For example:
pyfiles=[namefornameinos.listdir('somedir')ifname.endswith('.py')]
For filename matching, you may want to use the glob or fnmatch modules instead.
For example:
importglobpyfiles=glob.glob('somedir/*.py')fromfnmatchimportfnmatchpyfiles=[namefornameinos.listdir('somedir')iffnmatch(name,'*.py')]
Getting a directory listing is easy, but it only gives you the names of
entries in the directory. If you want to get additional metadata, such as
file sizes, modification dates, and so forth, you either need to use
additional functions in the os.path module or use the os.stat() function.
To collect the data. For example:
# Example of getting a directory listingimportosimportos.pathimportglobpyfiles=glob.glob('*.py')# Get file sizes and modification datesname_sz_date=[(name,os.path.getsize(name),os.path.getmtime(name))fornameinpyfiles]forname,size,mtimeinname_sz_date:(name,size,mtime)# Alternative: Get file metadatafile_metadata=[(name,os.stat(name))fornameinpyfiles]forname,metainfile_metadata:(name,meta.st_size,meta.st_mtime)
Last, but not least, be aware that there are subtle issues that
can arise in filename handling related to encodings. Normally,
the entries returned by a function such as os.listdir() are
decoded according to the system default filename encoding.
However, it’s possible under certain circumstances to encounter
un-decodable filenames. Recipes 5.14 and 5.15 have more details about
handling such names.
You want to perform file I/O operations using raw filenames that have not been decoded or encoded according to the default filename encoding.
By default, all filenames are encoded and decoded according to the
text encoding returned by sys.getfilesystemencoding(). For example:
>>>sys.getfilesystemencoding()'utf-8'>>>
If you want to bypass this encoding for some reason, specify a filename using a raw byte string instead. For example:
>>># Wrte a file using a unicode filename>>>withopen('jalape\xf1o.txt','w')asf:...f.write('Spicy!')...6>>># Directory listing (decoded)>>>importos>>>os.listdir('.')['jalapeño.txt']>>># Directory listing (raw)>>>os.listdir(b'.')# Note: byte string[b'jalapen\xcc\x83o.txt']>>># Open file with raw filename>>>withopen(b'jalapen\xcc\x83o.txt')asf:...(f.read())...Spicy!>>>
As you can see in the last two operations, the filename handling
changes ever so slightly when byte strings are supplied to
file-related functions, such as open() and os.listdir().
Under normal circumstances, you shouldn’t need to worry about filename encoding and decoding—normal filename operations should just work. However, many operating systems may allow a user through accident or malice to create files with names that don’t conform to the expected encoding rules. Such filenames may mysteriously break Python programs that work with a lot of files.
Reading directories and working with filenames as raw undecoded bytes has the potential to avoid such problems, albeit at the cost of programming convenience.
See Recipe 5.15 for a recipe on printing undecodable filenames.
Your program received a directory listing, but when it tried to print the
filenames, it crashed with a UnicodeEncodeError exception and a
cryptic message about “surrogates not allowed.”
When printing filenames of unknown origin, use this convention to avoid errors:
defbad_filename(filename):returnrepr(filename)[1:-1]try:(filename)exceptUnicodeEncodeError:(bad_filename(filename))
This recipe is about a potentially rare but very annoying problem regarding
programs that must manipulate the filesystem. By default, Python
assumes that all filenames are encoded according to the setting
reported by sys.getfilesystemencoding(). However, certain filesystems don’t necessarily enforce this encoding restriction, thereby allowing
files to be created without proper filename encoding. It’s
not common, but there is always the danger that some user will do
something silly and create such a file by accident (e.g., maybe
passing a bad filename to open() in some buggy code).
When executing a command such as os.listdir(), bad filenames leave
Python in a bind. On the one hand, it can’t just discard bad names.
On the other hand, it still can’t turn the filename into a proper
text string. Python’s solution to this problem is to take an undecodable
byte value \xhh in a filename and map it into a so-called “surrogate encoding” represented by the Unicode character \udchh.
Here is an example of how a bad directory listing might look
if it contained a filename bäd.txt, encoded as Latin-1 instead of UTF-8:
>>>importos>>>files=os.listdir('.')>>>files['spam.py', 'b\udce4d.txt', 'foo.txt']>>>
If you have code that manipulates filenames or even passes them to
functions such as open(), everything works normally. It’s only in
situations where you want to output the filename that you run into
trouble (e.g., printing it to the screen, logging it, etc.).
Specifically, if you tried to print the preceding listing, your program will
crash:
>>>fornameinfiles:...(name)...spam.pyTraceback (most recent call last):File"<stdin>", line2, in<module>UnicodeEncodeError:'utf-8' codec can't encode character '\udce4' inposition 1: surrogates not allowed>>>
The reason it crashes is that the character \udce4 is technically
invalid Unicode. It’s actually the second half of a two-character combination
known as a surrogate pair. However, since the first half is missing,
it’s invalid Unicode. Thus, the only way to produce successful output is
to take corrective action when a bad filename is encountered. For
example, changing the code to the recipe produces the following:
>>>fornameinfiles:...try:...(name)...exceptUnicodeEncodeError:...(bad_filename(name))...spam.pyb\udce4d.txtfoo.txt>>>
The choice of what to do for the bad_filename() function is largely
up to you. Another option is to re-encode the value in some
way, like this:
defbad_filename(filename):temp=filename.encode(sys.getfilesystemencoding(),errors='surrogateescape')returntemp.decode('latin-1')
Using this version produces the following output:
>>>fornameinfiles:...try:...(name)...exceptUnicodeEncodeError:...(bad_filename(name))...spam.pybäd.txtfoo.txt>>>
This recipe will likely be ignored by most readers. However, if you’re writing mission-critical scripts that need to work reliably with filenames and the filesystem, it’s something to think about. Otherwise, you might find yourself called back into the office over the weekend to debug a seemingly inscrutable error.
You want to add or change the Unicode encoding of an already open file without closing it first.
If you want to add Unicode encoding/decoding to an already existing
file object that’s opened in binary mode, wrap it with an
io.TextIOWrapper() object. For example:
importurllib.requestimportiou=urllib.request.urlopen('http://www.python.org')f=io.TextIOWrapper(u,encoding='utf-8')text=f.read()
If you want to change the encoding of an already open text-mode file,
use its detach() method to remove the existing text encoding layer
before replacing it with a new one. Here is an example of changing the encoding
on sys.stdout:
>>>importsys>>>sys.stdout.encoding'UTF-8'>>>sys.stdout=io.TextIOWrapper(sys.stdout.detach(),encoding='latin-1')>>>sys.stdout.encoding'latin-1'>>>
Doing this might break the output of your terminal. It’s only meant to illustrate.
The I/O system is built as a series of layers. You can see the layers yourself by trying this simple example involving a text file:
>>>f=open('sample.txt','w')>>>f<_io.TextIOWrapper name='sample.txt' mode='w' encoding='UTF-8'>>>>f.buffer<_io.BufferedWriter name='sample.txt'>>>>f.buffer.raw<_io.FileIO name='sample.txt' mode='wb'>>>>
In this example, io.TextIOWrapper is a text-handling layer that encodes
and decodes Unicode, io.BufferedWriter is a buffered I/O layer that handles
binary data, and io.FileIO is a raw file representing the low-level file
descriptor in the operating system. Adding or changing the text
encoding involves adding or changing the topmost io.TextIOWrapper layer.
As a general rule, it’s not safe to directly manipulate the different layers by accessing the attributes shown. For example, see what happens if you try to change the encoding using this technique:
>>>f<_io.TextIOWrappername='sample.txt'mode='w'encoding='UTF-8'>>>>f=io.TextIOWrapper(f.buffer,encoding='latin-1')>>>f<_io.TextIOWrappername='sample.txt'encoding='latin-1'>>>>f.write('Hello')Traceback(mostrecentcalllast):File"<stdin>",line1,in<module>ValueError:I/Ooperationonclosedfile.>>>
It doesn’t work because the original value of f got destroyed and
closed the underlying file in the process.
The detach() method disconnects the topmost layer of a file and returns
the next lower layer. Afterward, the top layer will no longer be usable.
For example:
>>>f=open('sample.txt','w')>>>f<_io.TextIOWrapper name='sample.txt' mode='w' encoding='UTF-8'>>>>b=f.detach()>>>b<_io.BufferedWriter name='sample.txt'>>>>f.write('hello')Traceback (most recent call last):File"<stdin>", line1, in<module>ValueError:underlying buffer has been detached>>>
Once detached, however, you can add a new top layer to the returned result. For example:
>>>f=io.TextIOWrapper(b,encoding='latin-1')>>>f<_io.TextIOWrapper name='sample.txt' encoding='latin-1'>>>>
Although changing the encoding has been shown, it is also possible to use this technique to change the line handling, error policy, and other aspects of file handling. For example:
>>>sys.stdout=io.TextIOWrapper(sys.stdout.detach(),encoding='ascii',...errors='xmlcharrefreplace')>>>('Jalape\u00f1o')Jalapeño>>>
Notice how the non-ASCII character ñ has been replaced by ñ in the output.
Simply write the byte data to the files underlying buffer. For
example:
>>>importsys>>>sys.stdout.write(b'Hello\n')Traceback (most recent call last):File"<stdin>", line1, in<module>TypeError:must be str, not bytes>>>sys.stdout.buffer.write(b'Hello\n')Hello5>>>
Similarly, binary data can be read from a text file by
reading from its buffer attribute instead.
The I/O system is built from layers. Text files are constructed by
adding a Unicode encoding/decoding layer on top of a buffered binary-mode
file. The buffer attribute simply points at this underlying file.
If you access it, you’ll bypass the text encoding/decoding layer.
The example involving sys.stdout might be viewed as a special case. By default, sys.stdout is always opened in text mode.
However, if you are writing a script that actually needs to dump
binary data to standard output, you can use the technique shown to
bypass the text encoding.
You have an integer file descriptor correponding to an already open I/O channel on the operating system (e.g., file, pipe, socket, etc.), and you want to wrap a higher-level Python file object around it.
A file descriptor is different than a normal open file in that it is
simply an integer handle assigned by the operating system to refer to
some kind of system I/O channel. If you happen to have such a file
descriptor, you can wrap a Python file object around it using the
open() function. However, you simply supply the integer file descriptor
as the first argument instead of the filename. For example:
# Open a low-level file descriptorimportosfd=os.open('somefile.txt',os.O_WRONLY|os.O_CREAT)# Turn into a proper filef=open(fd,'wt')f.write('hello world\n')f.close()
When the high-level file object is closed or destroyed, the underlying
file descriptor will also be closed. If this is not desired, supply
the optional closefd=False argument to open(). For example:
# Create a file object, but don't close underlying fd when donef=open(fd,'wt',closefd=False)...
On Unix systems, this technique of wrapping a file descriptor can be a convenient means for putting a file-like interface on an existing I/O channel that was opened in a different way (e.g., pipes, sockets, etc.). For instance, here is an example involving sockets:
fromsocketimportsocket,AF_INET,SOCK_STREAMdefecho_client(client_sock,addr):('Got connection from',addr)# Make text-mode file wrappers for socket reading/writingclient_in=open(client_sock.fileno(),'rt',encoding='latin-1',closefd=False)client_out=open(client_sock.fileno(),'wt',encoding='latin-1',closefd=False)# Echo lines back to the client using file I/Oforlineinclient_in:client_out.write(line)client_out.flush()client_sock.close()defecho_server(address):sock=socket(AF_INET,SOCK_STREAM)sock.bind(address)sock.listen(1)whileTrue:client,addr=sock.accept()echo_client(client,addr)
It’s important to emphasize that the above example is only meant to
illustrate a feature of the built-in open() function and that it
only works on Unix-based systems. If you are trying to put a
file-like interface on a socket and need your code to be cross
platform, use the makefile() method of sockets instead. However,
if portability is not a concern, you’ll find that the above solution
provides much better performance than using makefile().
You can also use this to make a kind of alias that allows an already
open file to be used in a slightly different way than how it was first
opened. For example, here’s how you could create a file object that
allows you to emit binary data on stdout (which is normally opened
in text mode):
importsys# Create a binary-mode file for stdoutbstdout=open(sys.stdout.fileno(),'wb',closefd=False)bstdout.write(b'Hello World\n')bstdout.flush()
Although it’s possible to wrap an existing file descriptor as a proper file, be aware that not all file modes may be supported and that certain kinds of file descriptors may have funny side effects (especially with respect to error handling, end-of-file conditions, etc.). The behavior can also vary according to operating system. In particular, none of the examples are likely to work on non-Unix systems. The bottom line is that you’ll need to thoroughly test your implementation to make sure it works as expected.
You need to create a temporary file or directory for use when your program executes. Afterward, you possibly want the file or directory to be destroyed.
The tempfile module has a variety of functions for performing this
task. To make an unnamed temporary file, use tempfile.TemporaryFile:
fromtempfileimportTemporaryFilewithTemporaryFile('w+t')asf:# Read/write to the filef.write('Hello World\n')f.write('Testing\n')# Seek back to beginning and read the dataf.seek(0)data=f.read()# Temporary file is destroyed
Or, if you prefer, you can also use the file like this:
f=TemporaryFile('w+t')# Use the temporary file...f.close()# File is destroyed
The first argument to TemporaryFile() is the file mode, which is
usually w+t for text and w+b for binary. This mode
simultaneously supports reading and writing, which is useful here
since closing the file to change modes would actually destroy it.
TemporaryFile() additionally accepts the same arguments as the
built-in open() function. For example:
withTemporaryFile('w+t',encoding='utf-8',errors='ignore')asf:...
On most Unix systems, the file created by TemporaryFile() is unnamed
and won’t even have a directory entry. If you want to relax this
constraint, use NamedTemporaryFile() instead. For example:
fromtempfileimportNamedTemporaryFilewithNamedTemporaryFile('w+t')asf:('filename is:',f.name)...# File automatically destroyed
Here, the f.name attribute of the opened file contains the filename
of the temporary file. This can be useful if it needs to be given to
some other code that needs to open the file. As with TemporaryFile(),
the resulting file is automatically deleted when it’s closed. If you
don’t want this, supply a delete=False keyword argument. For example:
withNamedTemporaryFile('w+t',delete=False)asf:('filename is:',f.name)...
To make a temporary directory, use tempfile.TemporaryDirectory().
For example:
fromtempfileimportTemporaryDirectorywithTemporaryDirectory()asdirname:('dirname is:',dirname)# Use the directory...# Directory and all contents destroyed
The TemporaryFile(), NamedTemporaryFile(), and
TemporaryDirectory() functions are probably the most convenient way to
work with temporary files and directories, because they automatically
handle all of the steps of creation and subsequent cleanup. At
a lower level, you can also use the mkstemp() and mkdtemp()
to create temporary files and directories. For example:
>>>importtempfile>>>tempfile.mkstemp()(3, '/var/folders/7W/7WZl5sfZEF0pljrEB1UMWE+++TI/-Tmp-/tmp7fefhv')>>>tempfile.mkdtemp()'/var/folders/7W/7WZl5sfZEF0pljrEB1UMWE+++TI/-Tmp-/tmp5wvcv6'>>>
However, these functions don’t really take care of further management.
For example, the mkstemp() function simply returns a raw OS file
descriptor and leaves it up to you to turn it into a proper file.
Similarly, it’s up to you to clean up the files if you want.
Normally, temporary files are created in the system’s default location,
such as /var/tmp or similar. To find out the actual location,
use the tempfile.gettempdir() function. For example:
>>>tempfile.gettempdir()'/var/folders/7W/7WZl5sfZEF0pljrEB1UMWE+++TI/-Tmp-'>>>
All of the temporary-file-related functions allow you to override this
directory as well as the naming conventions using the prefix, suffix,
and dir keyword arguments. For example:
>>>f=NamedTemporaryFile(prefix='mytemp',suffix='.txt',dir='/tmp')>>>f.name'/tmp/mytemp8ee899.txt'>>>
Last, but not least, to the extent possible, the tempfile() module
creates temporary files in the most secure manner possible. This
includes only giving access permission to the current user and taking
steps to avoid race conditions in file creation. Be aware that
there can be differences between platforms. Thus, you should make sure
to check the official documentation for the finer points.
You want to read and write data over a serial port, typically to interact with some kind of hardware device (e.g., a robot or sensor).
Although you can probably do this directly using Python’s built-in I/O primitives, your best bet for serial communication is to use the pySerial package. Getting started with the package is very easy. You simply open up a serial port using code like this:
importserialser=serial.Serial('/dev/tty.usbmodem641',# Device name variesbaudrate=9600,bytesize=8,parity='N',stopbits=1)
The device name will vary according to the kind of device and operating system. For instance, on
Windows, you can use a device of 0, 1, and so on, to open up the communication ports such as “COM0” and “COM1.”
Once open, you can read and write data using read(), readline(), and write() calls. For example:
ser.write(b'G1 X50 Y50\r\n')resp=ser.readline()
For the most part, simple serial communication should be pretty simple from this point forward.
Although simple on the surface, serial communication can sometimes get
rather messy. One reason you should use a package such as pySerial is that
it provides support for advanced features (e.g., timeouts, control flow,
buffer flushing, handshaking, etc.). For instance, if you want to
enable RTS-CTS handshaking, you simply provide a rtscts=True
argument to Serial(). The provided documentation is excellent, so there’s
little benefit to paraphrasing it here.
Keep in mind that all I/O involving serial ports is binary. Thus, make
sure you write your code to use bytes instead of text (or perform
proper text encoding/decoding as needed). The struct module may also be
useful should you need to create binary-coded commands or packets.
You need to serialize a Python object into a byte stream so that you can do things such as save it to a file, store it in a database, or transmit it over a network connection.
The most common approach for serializing data is to use the pickle
module. To dump an object to a file, you do this:
importpickledata=...# Some Python objectf=open('somefile','wb')pickle.dump(data,f)
To dump an object to a string, use pickle.dumps():
s=pickle.dumps(data)
To re-create an object from a byte stream, use either the pickle.load() or pickle.loads() functions.
For example:
# Restore from a filef=open('somefile','rb')data=pickle.load(f)# Restore from a stringdata=pickle.loads(s)
For most programs, usage of the dump() and load() function is all
you need to effectively use pickle. It simply works with most Python
data types and instances of user-defined classes. If you’re working
with any kind of library that lets you do things such as save/restore
Python objects in databases or transmit objects over the network,
there’s a pretty good chance that pickle is being used.
pickle is a Python-specific self-describing data encoding. By
self-describing, the serialized data contains information related to
the start and end of each object as well as information about its
type. Thus, you don’t need to worry about defining records—it simply
works. For example, if working with multiple objects, you can do this:
>>>importpickle>>>f=open('somedata','wb')>>>pickle.dump([1,2,3,4],f)>>>pickle.dump('hello',f)>>>pickle.dump({'Apple','Pear','Banana'},f)>>>f.close()>>>f=open('somedata','rb')>>>pickle.load(f)[1, 2, 3, 4]>>>pickle.load(f)'hello'>>>pickle.load(f){'Apple', 'Pear', 'Banana'}>>>
You can pickle functions, classes, and instances, but the resulting data only encodes name references to the associated code objects. For example:
>>>importmath>>>importpickle>>>pickle.dumps(math.cos)b'\x80\x03cmath\ncos\nq\x00.'>>>
When the data is unpickled, it is assumed that all of the required source is available. Modules, classes, and functions will automatically be imported as needed. For applications where Python data is being shared between interpreters on different machines, this is a potential maintenance issue, as all machines must have access to the same source code.
pickle.load() should never be used
on untrusted data. As a side effect of loading, pickle will
automatically load modules and make instances. However, an evildoer
who knows how pickle works can create “malformed” data that causes
Python to execute arbitrary system commands. Thus, it’s essential
that pickle only be used internally with interpreters that have some ability to authenticate one another.
Certain kinds of objects can’t be pickled. These are typically
objects that involve some sort of external system state, such as open
files, open network connections, threads, processes, stack frames, and
so forth. User-defined classes can sometimes work around these
limitations by providing __getstate__() and __setstate__()
methods. If defined, pickle.dump() will call __getstate__() to
get an object that can be pickled. Similarly, __setstate__() will
be invoked on unpickling. To illustrate what’s
possible, here is a class that internally defines a thread but can
still be pickled/unpickled:
# countdown.pyimporttimeimportthreadingclassCountdown:def__init__(self,n):self.n=nself.thr=threading.Thread(target=self.run)self.thr.daemon=Trueself.thr.start()defrun(self):whileself.n>0:('T-minus',self.n)self.n-=1time.sleep(5)def__getstate__(self):returnself.ndef__setstate__(self,n):self.__init__(n)
Try the following experiment involving pickling:
>>>importcountdown>>>c=countdown.Countdown(30)>>>T-minus30T-minus 29T-minus 28...>>># After a few moments>>>f=open('cstate.p','wb')>>>importpickle>>>pickle.dump(c,f)>>>f.close()
Now quit Python and try this after restart:
>>>f=open('cstate.p','rb')>>>pickle.load(f)countdown.Countdown object at 0x10069e2d0>T-minus 19T-minus 18...
You should see the thread magically spring to life again, picking up where it left off when you first pickled it.
pickle is not a particularly efficient encoding for large data
structures such as binary arrays created by libraries like the
array module or numpy. If you’re moving large amounts of array
data around, you may be better off simply saving bulk array data in a
file or using a more standardized encoding, such as HDF5 (supported by
third-party libraries).
Because of its Python-specific nature and attachment to
source code, you probably shouldn’t use pickle as a format for
long-term storage. For example, if the source code changes, all of
your stored data might break and become unreadable. Frankly, for
storing data in databases and archival storage, you’re probably better off
using a more standard data encoding, such as XML, CSV, or JSON.
These encodings are more standardized, supported by many different
languages, and more likely to be better adapted to changes in your
source code.
Last, but not least, be aware that pickle has a huge variety of
options and tricky corner cases. For the most common uses, you don’t
need to worry about them, but a look at the official documentation should
be required if you’re going to build a signficant application that
uses pickle for serialization.