Chapter 5. Files and I/O

All programs need to perform input and output. This chapter covers common idioms for working with different kinds of files, including text and binary files, file encodings, and other related matters. Techniques for manipulating filenames and directories are also covered.

Use the open() function with mode rt to read a text file. For example:

# Read the entire file as a single string
with open('somefile.txt', 'rt') as f:
    data = f.read()

# Iterate over the lines of the file
with open('somefile.txt', 'rt') as f:
    for line in f:
        # process line
        ...

Similarly, to write a text file, use open() with mode wt to write a file, clearing and overwriting the previous contents (if any). For example:

# Write chunks of text data
with open('somefile.txt', 'wt') as f:
    f.write(text1)
    f.write(text2)
    ...

# Redirected print statement
with open('somefile.txt', 'wt') as f:
    print(line1, file=f)
    print(line2, file=f)
    ...

To append to the end of an existing file, use open() with mode at.

By default, files are read/written using the system default text encoding, as can be found in sys.getdefaultencoding(). On most machines, this is set to utf-8. If you know that the text you are reading or writing is in a different encoding, supply the optional encoding parameter to open(). For example:

with open('somefile.txt', 'rt', encoding='latin-1') as f:
     ...

Python understands several hundred possible text encodings. However, some of the more common encodings are ascii, latin-1, utf-8, and utf-16. UTF-8 is usually a safe bet if working with web applications. ascii corresponds to the 7-bit characters in the range U+0000 to U+007F. latin-1 is a direct mapping of bytes 0-255 to Unicode characters U+0000 to U+00FF. latin-1 encoding is notable in that it will never produce a decoding error when reading text of a possibly unknown encoding. Reading a file as latin-1 might not produce a completely correct text decoding, but it still might be enough to extract useful data out of it. Also, if you later write the data back out, the original input data will be preserved.

Reading and writing text files is typically very straightforward. However, there are a number of subtle aspects to keep in mind. First, the use of the with statement in the examples establishes a context in which the file will be used. When control leaves the with block, the file will be closed automatically. You don’t need to use the with statement, but if you don’t use it, make sure you remember to close the file:

f = open('somefile.txt', 'rt')
data = f.read()
f.close()

Another minor complication concerns the recognition of newlines, which are different on Unix and Windows (i.e., \n versus \r\n). By default, Python operates in what’s known as “universal newline” mode. In this mode, all common newline conventions are recognized, and newline characters are converted to a single \n character while reading. Similarly, the newline character \n is converted to the system default newline character on output. If you don’t want this translation, supply the newline='' argument to open(), like this:

# Read with disabled newline translation
with open('somefile.txt', 'rt', newline='') as f:
     ...

To illustrate the difference, here’s what you will see on a Unix machine if you read the contents of a Windows-encoded text file containing the raw data hello world!\r\n:

>>> # Newline translation enabled (the default)
>>> f = open('hello.txt', 'rt')
>>> f.read()
'hello world!\n'

>>> # Newline translation disabled
>>> g = open('hello.txt', 'rt', newline='')
>>> g.read()
'hello world!\r\n'
>>>

A final issue concerns possible encoding errors in text files. When reading or writing a text file, you might encounter an encoding or decoding error. For instance:

>>> f = open('sample.txt', 'rt', encoding='ascii')
>>> f.read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.3/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
12: ordinal not in range(128)
>>>

If you get this error, it usually means that you’re not reading the file in the correct encoding. You should carefully read the specification of whatever it is that you’re reading and check that you’re doing it right (e.g., reading data as UTF-8 instead of Latin-1 or whatever it needs to be). If encoding errors are still a possibility, you can supply an optional errors argument to open() to deal with the errors. Here are a few samples of common error handling schemes:

>>> # Replace bad chars with Unicode U+fffd replacement char
>>> f = open('sample.txt', 'rt', encoding='ascii', errors='replace')
>>> f.read()
'Spicy Jalape?o!'
>>> # Ignore bad chars entirely
>>> g = open('sample.txt', 'rt', encoding='ascii', errors='ignore')
>>> g.read()
'Spicy Jalapeo!'
>>>

If you’re constantly fiddling with the encoding and errors arguments to open() and doing lots of hacks, you’re probably making life more difficult than it needs to be. The number one rule with text is that you simply need to make sure you’re always using the proper text encoding. When in doubt, use the default setting (typically UTF-8).

When reading binary data, the subtle semantic differences between byte strings and text strings pose a potential gotcha. In particular, be aware that indexing and iteration return integer byte values instead of byte strings. For example:

>>> # Text string
>>> t = 'Hello World'
>>> t[0]
'H'
>>> for c in t:
...     print(c)
...
H
e
l
l
o
...
>>> # Byte string
>>> b = b'Hello World'
>>> b[0]
72
>>> for c in b:
...     print(c)
...
72
101
108
108
111
...
>>>

If you ever need to read or write text from a binary-mode file, make sure you remember to decode or encode it. For example:

with open('somefile.bin', 'rb') as f:
    data = f.read(16)
    text = data.decode('utf-8')

with open('somefile.bin', 'wb') as f:
    text = 'Hello World'
    f.write(text.encode('utf-8'))

A lesser-known aspect of binary I/O is that objects such as arrays and C structures can be used for writing without any kind of intermediate conversion to a bytes object. For example:

import array
nums = array.array('i', [1, 2, 3, 4])
with open('data.bin','wb') as f:
    f.write(nums)

This applies to any object that implements the so-called “buffer interface,” which directly exposes an underlying memory buffer to operations that can work with it. Writing binary data is one such operation.

Many objects also allow binary data to be directly read into their underlying memory using the readinto() method of files. For example:

>>> import array
>>> a = array.array('i', [0, 0, 0, 0, 0, 0, 0, 0])
>>> with open('data.bin', 'rb') as f:
...     f.readinto(a)
...
16
>>> a
array('i', [1, 2, 3, 4, 0, 0, 0, 0])
>>>

However, great care should be taken when using this technique, as it is often platform specific and may depend on such things as the word size and byte ordering (i.e., big endian versus little endian). See Recipe 5.9 for another example of reading binary data into a mutable buffer.

The readinto() method of files can be used to fill any preallocated array with data. This even includes arrays created from the array module or libraries such as numpy. Unlike the normal read() method, readinto() fills the contents of an existing buffer rather than allocating new objects and returning them. Thus, you might be able to use it to avoid making extra memory allocations. For example, if you are reading a binary file consisting of equally sized records, you can write code like this:

record_size = 32           # Size of each record (adjust value)

buf = bytearray(record_size)
with open('somefile', 'rb') as f:
    while True:
        n = f.readinto(buf)
        if n < record_size:
            break
        # Use the contents of buf
        ...

Another interesting feature to use here might be a memoryview, which lets you make zero-copy slices of an existing buffer and even change its contents. For example:

>>> buf
bytearray(b'Hello World')
>>> m1 = memoryview(buf)
>>> m2 = m1[-5:]
>>> m2
<memory at 0x100681390>
>>> m2[:] = b'WORLD'
>>> buf
bytearray(b'Hello WORLD')
>>>

One caution with using f.readinto() is that you must always make sure to check its return code, which is the number of bytes actually read.

If the number of bytes is smaller than the size of the supplied buffer, it might indicate truncated or corrupted data (e.g., if you were expecting an exact number of bytes to be read).

Finally, be on the lookout for other “into” related functions in various library modules (e.g., recv_into(), pack_into(), etc.). Many other parts of Python have support for direct I/O or data access that can be used to fill or alter the contents of arrays and buffers.

See Recipe 6.12 for a significantly more advanced example of interpreting binary structures and usage of memoryviews.

Use the mmap module to memory map files. Here is a utility function that shows how to open a file and memory map it in a portable manner:

import os
import mmap

def memory_map(filename, access=mmap.ACCESS_WRITE):
    size = os.path.getsize(filename)
    fd = os.open(filename, os.O_RDWR)
    return mmap.mmap(fd, size, access=access)

To use this function, you would need to have a file already created and filled with data. Here is an example of how you could initially create a file and expand it to a desired size:

>>> size = 1000000
>>> with open('data', 'wb') as f:
...      f.seek(size-1)
...      f.write(b'\x00')
...
>>>

Now here is an example of memory mapping the contents using the memory_map() function:

>>> m = memory_map('data')
>>> len(m)
1000000
>>> m[0:10]
b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
>>> m[0]
0
>>> # Reassign a slice
>>> m[0:11] = b'Hello World'
>>> m.close()

>>> # Verify that changes were made
>>> with open('data', 'rb') as f:
...      print(f.read(11))
...
b'Hello World'
>>>

The mmap object returned by mmap() can also be used as a context manager, in which case the underlying file is closed automatically. For example:

>>> with memory_map('data') as m:
...      print(len(m))
...      print(m[0:10])
...
1000000
b'Hello World'
>>> m.closed
True
>>>

By default, the memory_map() function shown opens a file for both reading and writing. Any modifications made to the data are copied back to the original file. If read-only access is needed instead, supply mmap.ACCESS_READ for the access argument. For example:

m = memory_map(filename, mmap.ACCESS_READ)

If you intend to modify the data locally, but don’t want those changes written back to the original file, use mmap.ACCESS_COPY:

m = memory_map(filename, mmap.ACCESS_COPY)

Using mmap to map files into memory can be an efficient and elegant means for randomly accessing the contents of a file. For example, instead of opening a file and performing various combinations of seek(), read(), and write() calls, you can simply map the file and access the data using slicing operations.

Normally, the memory exposed by mmap() looks like a bytearray object. However, you can interpret the data differently using a memoryview. For example:

>>> m = memory_map('data')
>>> # Memoryview of unsigned integers
>>> v = memoryview(m).cast('I')
>>> v[0] = 7
>>> m[0:4]
b'\x07\x00\x00\x00'
>>> m[0:4] = b'\x07\x01\x00\x00'
>>> v[0]
263
>>>

It should be emphasized that memory mapping a file does not cause the entire file to be read into memory. That is, it’s not copied into some kind of memory buffer or array. Instead, the operating system merely reserves a section of virtual memory for the file contents. As you access different regions, those portions of the file will be read and mapped into the memory region as needed. However, parts of the file that are never accessed simply stay on disk. This all happens transparently, behind the scenes.

If more than one Python interpreter memory maps the same file, the resulting mmap object can be used to exchange data between interpreters. That is, all interpreters can read/write data simultaneously, and changes made to the data in one interpreter will automatically appear in the others. Obviously, some extra care is required to synchronize things, but this kind of approach is sometimes used as an alternative to transmitting data in messages over pipes or sockets.

As shown, this recipe has been written to be as general purpose as possible, working on both Unix and Windows. Be aware that there are some platform differences concerning the use of the mmap() call hidden behind the scenes. In addition, there are options to create anonymously mapped memory regions. If this is of interest to you, make sure you carefully read the Python documentation on the subject.

This recipe is about a potentially rare but very annoying problem regarding programs that must manipulate the filesystem. By default, Python assumes that all filenames are encoded according to the setting reported by sys.getfilesystemencoding(). However, certain filesystems don’t necessarily enforce this encoding restriction, thereby allowing files to be created without proper filename encoding. It’s not common, but there is always the danger that some user will do something silly and create such a file by accident (e.g., maybe passing a bad filename to open() in some buggy code).

When executing a command such as os.listdir(), bad filenames leave Python in a bind. On the one hand, it can’t just discard bad names. On the other hand, it still can’t turn the filename into a proper text string. Python’s solution to this problem is to take an undecodable byte value \xhh in a filename and map it into a so-called “surrogate encoding” represented by the Unicode character \udchh. Here is an example of how a bad directory listing might look if it contained a filename bäd.txt, encoded as Latin-1 instead of UTF-8:

>>> import os
>>> files = os.listdir('.')
>>> files
['spam.py', 'b\udce4d.txt', 'foo.txt']
>>>

If you have code that manipulates filenames or even passes them to functions such as open(), everything works normally. It’s only in situations where you want to output the filename that you run into trouble (e.g., printing it to the screen, logging it, etc.). Specifically, if you tried to print the preceding listing, your program will crash:

>>> for name in files:
...     print(name)
...
spam.py
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udce4' in
position 1: surrogates not allowed
>>>

The reason it crashes is that the character \udce4 is technically invalid Unicode. It’s actually the second half of a two-character combination known as a surrogate pair. However, since the first half is missing, it’s invalid Unicode. Thus, the only way to produce successful output is to take corrective action when a bad filename is encountered. For example, changing the code to the recipe produces the following:

>>> for name in files:
...     try:
...             print(name)
...     except UnicodeEncodeError:
...             print(bad_filename(name))
...
spam.py
b\udce4d.txt
foo.txt
>>>

The choice of what to do for the bad_filename() function is largely up to you. Another option is to re-encode the value in some way, like this:

def bad_filename(filename):
    temp = filename.encode(sys.getfilesystemencoding(), errors='surrogateescape')
    return temp.decode('latin-1')

Using this version produces the following output:

>>> for name in files:
...     try:
...             print(name)
...     except UnicodeEncodeError:
...             print(bad_filename(name))
...
spam.py
bäd.txt
foo.txt
>>>

This recipe will likely be ignored by most readers. However, if you’re writing mission-critical scripts that need to work reliably with filenames and the filesystem, it’s something to think about. Otherwise, you might find yourself called back into the office over the weekend to debug a seemingly inscrutable error.

The I/O system is built as a series of layers. You can see the layers yourself by trying this simple example involving a text file:

>>> f = open('sample.txt','w')
>>> f
<_io.TextIOWrapper name='sample.txt' mode='w' encoding='UTF-8'>
>>> f.buffer
<_io.BufferedWriter name='sample.txt'>
>>> f.buffer.raw
<_io.FileIO name='sample.txt' mode='wb'>
>>>

In this example, io.TextIOWrapper is a text-handling layer that encodes and decodes Unicode, io.BufferedWriter is a buffered I/O layer that handles binary data, and io.FileIO is a raw file representing the low-level file descriptor in the operating system. Adding or changing the text encoding involves adding or changing the topmost io.TextIOWrapper layer.

As a general rule, it’s not safe to directly manipulate the different layers by accessing the attributes shown. For example, see what happens if you try to change the encoding using this technique:

>>> f
<_io.TextIOWrapper name='sample.txt' mode='w' encoding='UTF-8'>
>>> f = io.TextIOWrapper(f.buffer, encoding='latin-1')
>>> f
<_io.TextIOWrapper name='sample.txt' encoding='latin-1'>
>>> f.write('Hello')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: I/O operation on closed file.
>>>

It doesn’t work because the original value of f got destroyed and closed the underlying file in the process.

The detach() method disconnects the topmost layer of a file and returns the next lower layer. Afterward, the top layer will no longer be usable. For example:

>>> f = open('sample.txt', 'w')
>>> f
<_io.TextIOWrapper name='sample.txt' mode='w' encoding='UTF-8'>
>>> b = f.detach()
>>> b
<_io.BufferedWriter name='sample.txt'>
>>> f.write('hello')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: underlying buffer has been detached
>>>

Once detached, however, you can add a new top layer to the returned result. For example:

>>> f = io.TextIOWrapper(b, encoding='latin-1')
>>> f
<_io.TextIOWrapper name='sample.txt' encoding='latin-1'>
>>>

Although changing the encoding has been shown, it is also possible to use this technique to change the line handling, error policy, and other aspects of file handling. For example:

>>> sys.stdout = io.TextIOWrapper(sys.stdout.detach(), encoding='ascii',
...                               errors='xmlcharrefreplace')
>>> print('Jalape\u00f1o')
Jalape&#241;o
>>>

Notice how the non-ASCII character ñ has been replaced by &#241; in the output.

On Unix systems, this technique of wrapping a file descriptor can be a convenient means for putting a file-like interface on an existing I/O channel that was opened in a different way (e.g., pipes, sockets, etc.). For instance, here is an example involving sockets:

from socket import socket, AF_INET, SOCK_STREAM

def echo_client(client_sock, addr):
    print('Got connection from', addr)

    # Make text-mode file wrappers for socket reading/writing
    client_in = open(client_sock.fileno(), 'rt', encoding='latin-1',
                         closefd=False)
    client_out = open(client_sock.fileno(), 'wt', encoding='latin-1',
                          closefd=False)

    # Echo lines back to the client using file I/O
    for line in client_in:
        client_out.write(line)
        client_out.flush()
    client_sock.close()

def echo_server(address):
    sock = socket(AF_INET, SOCK_STREAM)
    sock.bind(address)
    sock.listen(1)
    while True:
        client, addr = sock.accept()
        echo_client(client, addr)

It’s important to emphasize that the above example is only meant to illustrate a feature of the built-in open() function and that it only works on Unix-based systems. If you are trying to put a file-like interface on a socket and need your code to be cross platform, use the makefile() method of sockets instead. However, if portability is not a concern, you’ll find that the above solution provides much better performance than using makefile().

You can also use this to make a kind of alias that allows an already open file to be used in a slightly different way than how it was first opened. For example, here’s how you could create a file object that allows you to emit binary data on stdout (which is normally opened in text mode):

import sys
# Create a binary-mode file for stdout
bstdout = open(sys.stdout.fileno(), 'wb', closefd=False)
bstdout.write(b'Hello World\n')
bstdout.flush()

Although it’s possible to wrap an existing file descriptor as a proper file, be aware that not all file modes may be supported and that certain kinds of file descriptors may have funny side effects (especially with respect to error handling, end-of-file conditions, etc.). The behavior can also vary according to operating system. In particular, none of the examples are likely to work on non-Unix systems. The bottom line is that you’ll need to thoroughly test your implementation to make sure it works as expected.

The tempfile module has a variety of functions for performing this task. To make an unnamed temporary file, use tempfile.TemporaryFile:

from tempfile import TemporaryFile

with TemporaryFile('w+t') as f:
     # Read/write to the file
     f.write('Hello World\n')
     f.write('Testing\n')

     # Seek back to beginning and read the data
     f.seek(0)
     data = f.read()

# Temporary file is destroyed

Or, if you prefer, you can also use the file like this:

f = TemporaryFile('w+t')
# Use the temporary file
...
f.close()
# File is destroyed

The first argument to TemporaryFile() is the file mode, which is usually w+t for text and w+b for binary. This mode simultaneously supports reading and writing, which is useful here since closing the file to change modes would actually destroy it. TemporaryFile() additionally accepts the same arguments as the built-in open() function. For example:

with TemporaryFile('w+t', encoding='utf-8', errors='ignore') as f:
     ...

On most Unix systems, the file created by TemporaryFile() is unnamed and won’t even have a directory entry. If you want to relax this constraint, use NamedTemporaryFile() instead. For example:

from tempfile import NamedTemporaryFile

with NamedTemporaryFile('w+t') as f:
    print('filename is:', f.name)
    ...

# File automatically destroyed

Here, the f.name attribute of the opened file contains the filename of the temporary file. This can be useful if it needs to be given to some other code that needs to open the file. As with TemporaryFile(), the resulting file is automatically deleted when it’s closed. If you don’t want this, supply a delete=False keyword argument. For example:

with NamedTemporaryFile('w+t', delete=False) as f:
    print('filename is:', f.name)
    ...

To make a temporary directory, use tempfile.TemporaryDirectory(). For example:

from tempfile import TemporaryDirectory
with TemporaryDirectory() as dirname:
     print('dirname is:', dirname)
     # Use the directory
     ...
# Directory and all contents destroyed

The TemporaryFile(), NamedTemporaryFile(), and TemporaryDirectory() functions are probably the most convenient way to work with temporary files and directories, because they automatically handle all of the steps of creation and subsequent cleanup. At a lower level, you can also use the mkstemp() and mkdtemp() to create temporary files and directories. For example:

>>> import tempfile
>>> tempfile.mkstemp()
(3, '/var/folders/7W/7WZl5sfZEF0pljrEB1UMWE+++TI/-Tmp-/tmp7fefhv')
>>> tempfile.mkdtemp()
'/var/folders/7W/7WZl5sfZEF0pljrEB1UMWE+++TI/-Tmp-/tmp5wvcv6'
>>>

However, these functions don’t really take care of further management. For example, the mkstemp() function simply returns a raw OS file descriptor and leaves it up to you to turn it into a proper file. Similarly, it’s up to you to clean up the files if you want.

Normally, temporary files are created in the system’s default location, such as /var/tmp or similar. To find out the actual location, use the tempfile.gettempdir() function. For example:

>>> tempfile.gettempdir()
'/var/folders/7W/7WZl5sfZEF0pljrEB1UMWE+++TI/-Tmp-'
>>>

All of the temporary-file-related functions allow you to override this directory as well as the naming conventions using the prefix, suffix, and dir keyword arguments. For example:

>>> f = NamedTemporaryFile(prefix='mytemp', suffix='.txt', dir='/tmp')
>>> f.name
'/tmp/mytemp8ee899.txt'
>>>

Last, but not least, to the extent possible, the tempfile() module creates temporary files in the most secure manner possible. This includes only giving access permission to the current user and taking steps to avoid race conditions in file creation. Be aware that there can be differences between platforms. Thus, you should make sure to check the official documentation for the finer points.

Although you can probably do this directly using Python’s built-in I/O primitives, your best bet for serial communication is to use the pySerial package. Getting started with the package is very easy. You simply open up a serial port using code like this:

import serial
ser = serial.Serial('/dev/tty.usbmodem641',  # Device name varies
                     baudrate=9600,
                     bytesize=8,
                     parity='N',
                     stopbits=1)

The device name will vary according to the kind of device and operating system. For instance, on Windows, you can use a device of 0, 1, and so on, to open up the communication ports such as “COM0” and “COM1.” Once open, you can read and write data using read(), readline(), and write() calls. For example:

ser.write(b'G1 X50 Y50\r\n')
resp = ser.readline()

For the most part, simple serial communication should be pretty simple from this point forward.

Discussion

Although simple on the surface, serial communication can sometimes get rather messy. One reason you should use a package such as pySerial is that it provides support for advanced features (e.g., timeouts, control flow, buffer flushing, handshaking, etc.). For instance, if you want to enable RTS-CTS handshaking, you simply provide a rtscts=True argument to Serial(). The provided documentation is excellent, so there’s little benefit to paraphrasing it here.

Keep in mind that all I/O involving serial ports is binary. Thus, make sure you write your code to use bytes instead of text (or perform proper text encoding/decoding as needed). The struct module may also be useful should you need to create binary-coded commands or packets.

For most programs, usage of the dump() and load() function is all you need to effectively use pickle. It simply works with most Python data types and instances of user-defined classes. If you’re working with any kind of library that lets you do things such as save/restore Python objects in databases or transmit objects over the network, there’s a pretty good chance that pickle is being used.

pickle is a Python-specific self-describing data encoding. By self-describing, the serialized data contains information related to the start and end of each object as well as information about its type. Thus, you don’t need to worry about defining records—it simply works. For example, if working with multiple objects, you can do this:

>>> import pickle
>>> f = open('somedata', 'wb')
>>> pickle.dump([1, 2, 3, 4], f)
>>> pickle.dump('hello', f)
>>> pickle.dump({'Apple', 'Pear', 'Banana'}, f)
>>> f.close()
>>> f = open('somedata', 'rb')
>>> pickle.load(f)
[1, 2, 3, 4]
>>> pickle.load(f)
'hello'
>>> pickle.load(f)
{'Apple', 'Pear', 'Banana'}
>>>

You can pickle functions, classes, and instances, but the resulting data only encodes name references to the associated code objects. For example:

>>> import math
>>> import pickle
>>> pickle.dumps(math.cos)
b'\x80\x03cmath\ncos\nq\x00.'
>>>

When the data is unpickled, it is assumed that all of the required source is available. Modules, classes, and functions will automatically be imported as needed. For applications where Python data is being shared between interpreters on different machines, this is a potential maintenance issue, as all machines must have access to the same source code.

Certain kinds of objects can’t be pickled. These are typically objects that involve some sort of external system state, such as open files, open network connections, threads, processes, stack frames, and so forth. User-defined classes can sometimes work around these limitations by providing __getstate__() and __setstate__() methods. If defined, pickle.dump() will call __getstate__() to get an object that can be pickled. Similarly, __setstate__() will be invoked on unpickling. To illustrate what’s possible, here is a class that internally defines a thread but can still be pickled/unpickled:

# countdown.py
import time
import threading

class Countdown:
    def __init__(self, n):
        self.n = n
        self.thr = threading.Thread(target=self.run)
        self.thr.daemon = True
        self.thr.start()

    def run(self):
        while self.n > 0:
            print('T-minus', self.n)
            self.n -= 1
            time.sleep(5)

    def __getstate__(self):
        return self.n

    def __setstate__(self, n):
        self.__init__(n)

Try the following experiment involving pickling:

>>> import countdown
>>> c = countdown.Countdown(30)
>>> T-minus 30
T-minus 29
T-minus 28
...

>>> # After a few moments
>>> f = open('cstate.p', 'wb')
>>> import pickle
>>> pickle.dump(c, f)
>>> f.close()

Now quit Python and try this after restart:

>>> f = open('cstate.p', 'rb')
>>> pickle.load(f)
countdown.Countdown object at 0x10069e2d0>
T-minus 19
T-minus 18
...

You should see the thread magically spring to life again, picking up where it left off when you first pickled it.

pickle is not a particularly efficient encoding for large data structures such as binary arrays created by libraries like the array module or numpy. If you’re moving large amounts of array data around, you may be better off simply saving bulk array data in a file or using a more standardized encoding, such as HDF5 (supported by third-party libraries).

Because of its Python-specific nature and attachment to source code, you probably shouldn’t use pickle as a format for long-term storage. For example, if the source code changes, all of your stored data might break and become unreadable. Frankly, for storing data in databases and archival storage, you’re probably better off using a more standard data encoding, such as XML, CSV, or JSON. These encodings are more standardized, supported by many different languages, and more likely to be better adapted to changes in your source code.

Last, but not least, be aware that pickle has a huge variety of options and tricky corner cases. For the most common uses, you don’t need to worry about them, but a look at the official documentation should be required if you’re going to build a signficant application that uses pickle for serialization.