For anything more than a small program, you will want to organize your code into modules. This is also the unit of organization for reusable libraries—both libraries you create, and outside code you import into your own codebase.
Python’s module system is a delight: easy to use, well designed, and extremely flexible. Making full use requires understanding its mechanisms for imports, and how that works with namespacing. We will dive deep into all of that in this chapter.
In particular, we will focus on how modules evolve. Requirements change over time; as new requirements come in, and as you get more clarity on existing requirements, and how to best organize your codebase. So we will cover the best practices for refactoring and updating module structure during the development process.
This is an important and practical part of working with modules. But for some reason, it is never talked or written about. Until now.
To touch on everything important about modules, we will follow the lifecycle of a small Python script that gradually grows in scope and lines of code—eventually growing to a point where we want to package its components1 into reusable modules. We do this not just for sensible organization, but also so we can import them into other applications. This evolution from script to spin-off library happens all the time in real software development.
Imagine you create a little Python script, called
findpattern.py. Its job is simple: to scan through a text file, and
print out those lines which contain a certain substring.
This is similar to a program called grep in the Linux (and Unix)
world, which operates on a stream of lines of text, filtering for
those which match a pattern you specify. findpattern.py is less
powerful than grep. But it is more portable, and—more importantly—lets us illustrate everything we need in this chapter.
# findpattern.pyimportsysdefgrepfile(pattern,path):withopen(path)ashandle:forlineinhandle:ifpatterninline:yieldline.rstrip('\n')pattern,path=sys.argv[1],sys.argv[2]forlineingrepfile(pattern,path):(line)
This program defines a generator function called grepfile(), which
may remind you of the matching_lines_from_file() function in
Chapter 1—because it demonstrates many of the same Python best
practices. It takes two arguments: pattern, the substring pattern
to search for, and path, the path to the file to search in.
Imagine you have a file named log.txt, containing simple log
messages, one per line, like this:
ERROR: Out of milk WARNING: Running low on orange juice INFO: Temperature adjusted to 39 degrees ERROR: Alien spacecraft crashed
Your program needs to get these two values from the command line. So you invoke the program like this:
% python findpattern.py ERROR log.txt
It is best to do extract the command-line arguments with a specialist
library for parsing them, such as Python’s built-in argparse
module. But to avoid a side quest explaining how that works, we will
simply extract them from argv in the sys module.
sys.argv is a list of strings. Its first element, at index 0, is the
name of the program you are running; so its value will be "findpattern.py" in this case. What we want are the command-line
arguments, ERROR and log.txt. These will be at indices 1 and 2 in
sys.argv, respectively. So we load those into the pattern and
path variables. With that out of the way, we can simply iterate through
the generator object returned by grepfile(), printing out each
matching line.
Whenever you create a Python file, that also creates a module of the
same name (minus the .py extension). So our findpattern.py file,
in addition to being a complete program, also creates a module named
findpattern. This happens automatically; you do not have to declare
this, or take any extra steps.
Whenever you have a module, that means you can import from it. In
particular, we have a nice function named grepfile(), which is
general enough that we may want to reuse it in other programs. Let’s
do that.
You decide to create a program called finderrors.py to show only
the ERROR lines—similar to findpattern.py, but more
specialized.
grepfile() is a good tool for implementing this, so you decide to
import it. Your first version looks like this:
# finderrors.pyimportsysfromfindpatternimportgrepfilepath=sys.argv[1]forlineingrepfile('ERROR:',path):(line)
This looks straightforward. But when you run the program, you get a strange error:
$ python finderrors.py log.txt
Traceback (most recent call last):
File "finderrors.py", line 3, in <module>
from findpattern import grepfile
File "findpattern.py", line 10, in <module>
pattern, path = sys.argv[1], sys.argv[2]
IndexError: list index out of range
IndexError? What on earth would cause that?
Let’s step through the stack trace. You can see that the error originates
in line 3 of finderrors.py. That is interesting, because it is the
import line—where grepfile() is imported from the findpattern
module.
Next, the stack trace descends into the findpattern.py
file. Specifically, on line 10, where it unpacks variables from
sys.argv.
Now you understand the problem. findpattern.py was written to use
as a standalone program. But we are creating a new program and want
to reuse code. In the process of importing, sys.argv is
unpacked, expecting two arguments, instead of the one argument which
finderrors.py actually takes.
When you import from a module, Python creates that module by executing
its code. It executes the def statements to create the functions,
and executes the class statements to create the classes. But in the
process of creating the module, Python executes the entire file—including that sys.argv line.
So what do we do? The code is correct for running
findpattern.py. But it doesn’t allow finderrors.py to run. How do
we import from findpattern and allow both programs to run?
The solution is to use a main guard. It looks like this:
# Replace the final 3 lines of findpattern.py with this:if__name__=="__main__":pattern,path=sys.argv[1],sys.argv[2]forlineingrepfile(pattern,path):(line)
We have taken the last three lines and indented them inside an “if”
block. The “if” condition references a magic variable called
__name__. This variable is automatically and globally defined, and
will always have a string value. But to understand the nature of that
value, we must first understand something about how Python programs
are executed.
When you execute a Python program, often the code which makes up the
program is spread across more than one file. That is what happens
here, right? When you run finderrors.py, some of this program’s code
is in that file. But some of the code is in the findpattern.py
file. So the Python code for this program is distributed over two
different files.
But here’s the important point: whenever you run a Python program
consisting of several files, one of those will be the main
executable for the program. That is the file which will show in the
process table or task manager of the operating system. When you run
the finderrors.py program, then finderrors.py is the main
executable. That is what will show up in the process table, even
though it is also utilizing code in the findpattern.py file. Of
course, real Python programs often use dozens, hundreds, or even
thousands of distinct Python files, especially when your program uses
many libraries.
And so, back to __name__. This magic variable will have one of two
values:
If it is referenced inside the main executable file, its value will
be the string “__main__”.
If it is referenced in any other Python file, it will be the name of the module that file creates.
And so we use this in findpattern.py, to make this file usable as a
stand-alone program, and as a module which can be imported from. By
checking the value of __name__, we effectively partition the file
into two parts: code which is always executed (and thus its objects
are importable), and code which is executed only when this file is
itself run as a program.
Here is the full source of our amended findpattern.py:
importsysdefgrepfile(pattern,path):withopen(path)ashandle:forlineinhandle:ifpatterninline:yieldline.rstrip('\n')if__name__=="__main__":pattern,path=sys.argv[1],sys.argv[2]forlineingrepfile(pattern,path):(line)
Let’s step through this. Suppose you execute findpattern.py as the
program. Then the findpattern.py file is the main executable, so the
value of __name__ will be equal to “__main__”. This means the “if”
condition evaluates to True, and the final three lines are executed.
In contrast, imagine you run finderrors.py. This file imports from
the findpattern module, which means all lines in the
findpattern.py file are executed. But the value of __name__ is set
to the string “findpattern”. So the “if” condition evaluates to False,
and the final three lines are skipped. Now both programs work:
$ python findpattern.py ERROR log.txt ERROR: Out of milk ERROR: Alien spacecraft crashed $ python finderrors.py log.txt ERROR: Out of milk ERROR: Alien spacecraft crashed
Time passes, and new requirements roll in—as they always do. You now
need a variant of the grepfile() function, which is case
insensitive. Let’s call it igrepfile().
In your team, suppose you are in charge of finderrors.py, and your
coworker is in charge of findpattern.py. You are importing the
grepfile() function from findpattern.py. But where are you going to put the new igrepfile() function?
You cannot put it in findpattern.py. Your coworker is politely
allowing you to reuse his code by importing from the findpattern
module, but he draws the line at adding new functions he does not care
about to his file. That is just creating more maintenance work for him.
At this point, it makes more sense to organize the code into a
different module. Up to now, findpattern.py has been filling two
completely different roles. It’s a program that does something useful,
and it is a container of sorts: holding components that can be
reused by other programs—the grepfile() function, in particular.
Now you are going to separate these roles. You create a new file,
called greputils.py. It is not meant to be an executable program.
Its sole purpose is to create a module, called greputils, holding
reusable code.
In particular, it will define the grepfile() function. You cut the
entire definition of grepfile() from findpattern.py, and paste it
into greputils.py. Then in both findpattern.py and
finderrors.py, you add this import statement:
fromgreputilsimportgrepfile
Congratulations: you have created a reusable library for your team’s software.
This is a great step forward in code organization. You can stuff
whatever new classes and functions make sense in this greputils
module…without molesting any other program which relies on it. A
clear separation of roles.
Notice something else: this has evolved naturally. The entire process described above is fully realistic, in terms of how a module of reusable code “emerges” during normal application development. This example shows how it happens in a team, but it happens much the same way when you are developing solo. As you develop new programs, you naturally find you want to reuse functions and classes from your previous projects, and it only makes sense to collect them all in a single convenient module.
The current versions of our files:
# greputils.pydefgrepfile(pattern,path):withopen(path)ashandle:forlineinhandle:ifpatterninline:yieldline.rstrip('\n')
# findpattern.pyimportsysfromgreputilsimportgrepfilepattern,path=sys.argv[1],sys.argv[2]forlineingrepfile(pattern,path):(line)
# finderrors.pyimportsysfromgreputilsimportgrepfilepath=sys.argv[1]forlineingrepfile('ERROR:',path):(line)
Continuing to the next phase of our module’s evolution, imagine you
keep adding new functions and classes over time. So the greputils.py
file gets bigger and bigger, with more and more lines of code.
To make it concrete: imagine you often have to check whether a text file contains a certain substring. You just want a yes or no answer to that question; you don’t need the set of lines that match, you just want to know whether at least one line matches, or none of them do.
This function should return True if the text file contains that
string, and False otherwise. You call this function
contains():
defcontains(pattern,path):withopen(path)ashandle:forlineinhandle:ifpatterninline:returnTruereturnFalse
You also create a case-insensitive version named icontains(), and so
on; you continue creating new functions and classes over time as you
and your teammates need them, always adding them to greputils.py.
Eventually, this file will just get too big. It is going to be awkward to work with at best, and encourage excessive intercoupling and code entanglement at worst. If there are multiple developers hacking the same codebase, it may even increase the frequency of merge conflicts. And code logically separated into different files is just easier to work with, as separate tabs in IDEs.
So we refactor the module. Make this distinction: the module is not
the file greputils.py. Rather, the module is greputils, which
happens to be implemented as a file named greputils.py. But there are
other ways to implement the same module, with the same contents and
the exact same interface for importing.
Specifically, we can implement the module as a collection of several files, organized in a particular way. How do you do that?
The first step is to create a directory, with the name of the module:
greputils. And this directory will contain several files.
The first is a file named __init__.py. Just like the method for
Python classes, except with .py at the end. This is where we
create the interface to the module, in terms of what you can import
from it directly.
But in order to do that, we first create another file in the
greputils folder, which we will call files.py. This is where we
will put the file-grepping functions: grepfile(), igrepfile(), and
any others we have created so far. We will also create a third file,
named contain.py, which is where we put the “does this file contain
this string” functions, like contains() and icontains(). Now, our
filesystem layout looks like this:
greputils/ greputils/__init__.py greputils/files.py greputils/contain.py
This solves our giant file problem. We can put as many files in this
greputils folder as we want. No matter how many classes and
functions we invent, we can split them up into reasonably sized
files, each defining a submodule within the main greputils module.
But we have one more step, which returns us to the __init__.py file.
Remember I said this file creates the import interface of the
module. Originally, when everything lived in a single greputils.py
file, you could write code like:
# used by findpattern.pyfromgreputilsimportgrepfile# used by a different programfromgreputilsimporticontains
When you refactor a function, you want its interface and outward behavior to be unchanged. Your modified implementation should be invisible to anyone using that function. A close analogy holds for modules. When we “refactor” it to be a directory rather than a single file, we want these import statements to continue to work just as they did before.
When you implement a module as a single file, people can
automatically import anything in the module with a simple from
MODULENAME import ... statement. But when you implement a module as a
directory, that is not automatic. You must explicitly declare what
components will be directly importable in this way.
You do that in the __init__.py file. To see how it works, we must
understand a new concept, called submodules.
When you implement a module as a directory, the files in that
directory create submodules. These are accessed in a hierarchy under
the top-level module, using a dotted-name syntax. For example, if
grepfile() is in the greputils/files.py file, then you can import
it with a statement like:
fromgreputils.filesimportgrepfile
But we do not want to break all our existing import statements. Even
if we did not mind the extra typing. The solution relies on the fact
that anything inside greputils/__init__.py will be directly
importable from the module itself. So all we have to do is import the
submodule contents into the __init__.py file. One way is to do it
like this:
# inside greputils/__init__.py# (Spoiler: Don't do it like this, there is a better way).fromgreputils.filesimportgrepfile
That will actually work. But it is not as modular as it could be. What if you change the name of the module in the future? This creates a new place we need to change it, or miss it and create a bug. Or maybe we want to reorganize the module in some other way, which creates the same problem.
Instead, do a relative import. Inside the __init__.py file, write
this:
# inside greputils/__init__.pyfrom.filesimportgrepfile
So you are importing from .files, not greputils.files. That
leading “.” is important here. This is what makes it a relative
import, and its syntax only works inside a module that is implemented
as a directory. When you use it inside the __init__.py file, it will
import objects from sub-modules relative to that directory, and make
them accessible inside the __init__.py file.
We do all this because anything in the __init__.py file is directly
importable from the module itself. In other words, because of these
relative imports, this statement works again:
fromgreputilsimportgrepfile
We do this for all components for all the submodules. Our final
__init__.py file, including everything we have created so far, looks
like this:
# greputils/__init__.pyfrom.filesimportgrepfile,igrepfilefrom.containimportcontains,icontains
Now we can import all these functions—grepfile(), igrepfile(),
contains(), and icontains()—directly:
# In one program...fromgreputilsimportcontains,grepfile# And in another:fromgreputilsimporticontains,igrepfile
You can be selective in what you import into the __init__.py file.
Your submodules may contain internal helper functions
and classes which you do not necessarily want other people importing,
using, and crafting their code to depend on. This is likely to be
common, in fact, as your modules grow beyond a certain size.
Handling this is easy: simply do not import them into the
__init__.py file. Then they are not directly importable from the
module. Someone can still import them from the dotted path of the
submodule if they peek into the source to find they are there, but few
people will do that. And if you change one of those internal
components later in a way that breaks their code, that is arguably
their problem and not yours.
When importing multiple components, you can do it all on one line,
with a single import statement. The greputils/__init__.py does
this, importing from its submodules:
# greputils/__init__.pyfrom.filesimportgrepfile,igrepfilefrom.containimportcontains,icontains
If you are importing a couple of items, that works fine. But suppose you are importing more. Putting all those on one line creates problems, especially if you are working in a team.
Imagine you and I are developing the same codebase, in different
feature branches. In your branch, you add a new function named
grepfileregex(). You put this in greputils/files.py, and modify
greputils/__init__.py like this:
from.filesimportgrepfile,igrepfile,grepfileregex
In my branch, I rename igrepfile() to grepfilei(). And edit the
__init__.py file like this:
from.filesimportgrepfile,grepfilei
When we finish our work in the branch, and merge into main, suddenly we have a race condition. Whoever merges first will have no problem. But the one who merges last will get a merge conflict. Our current version-control tools do not know how to resolve this manually; the loser of this race condition has to manually clean it up. It is time-consuming at best, and it risks creating new bugs at worst.
There is an easy solution. Python allows you to split imported components over several lines, using this syntax:
from.filesimport(grepfile,igrepfile,)from.containimport(contains,icontains,)
Note the parentheses. This is much better, because our version control software can resolve the merge automatically, without the risk of introducing new bugs. As a general rule, if you are importing more than one item, your life will be happier if you split the import over several lines in this way.
Notice another detail here. Just looking at the first relative import:
from.filesimport(grepfile,igrepfile,)
Do you see how there is a comma after “igrepfile”, even though it is
the last item in the sequence? Python does not require you to put a
comma there. But I recommend you do, because it pinpoints the diffs
even further.
Imagine you do not put that comma there, then add your new
grepfileregex() function. So you are changing it from this:
# no ending commafrom.filesimport(grepfile,igrepfile)
…to this:
from.filesimport(grepfile,igrepfile,grepfileregex)
You are adding a line for grepfileregex() but also modifying the
previous line—adding a comma after igrepfile. So the diff will
delete the old line without the comma, and add two lines. Like this:
- igrepfile + igrepfile, + grepfileregex
In contrast, if you append a comma after every imported item, you are changing it from this:
# WITH an ending commafrom.filesimport(grepfile,igrepfile,)
…to this:
from.filesimport(grepfile,igrepfile,grepfileregex,)
This means your diff goes from three lines down to just one:
+ grepfileregex,
Like I said: more pinpointed diffs. Only good can come from that. Not only will you have fewer merge conflicts. If your team is doing code reviews, the reviewer will have to think less to decipher what you are changing. They are less likely to miss a bug that would otherwise be easy for them to catch.
This directory structure for modules is recursive. A module can be implemented as a file, or as a directory; and this also applies to submodules, sub-submodules, and so on.
Imagine you add a new submodule, called greputils.net, collecting
functions that check for contents of URLs over a network. At first,
greputils.net does not have many functions or classes. So this
submodule is implemented as a single file, net.py, in the
greputils folder.
But over time, you add something like this:
greputils/ greputils/__init__.py greputils/files.py greputils/contain.py greputils/net/__init__.py greputils/net/html.py greputils/net/text.py greputils/net/json.py
At this point, you have a choice to make about the module
interface—which will be somewhat dictated by how other code is using
the greputils.net components already. In the first choice,
you will import components into greputils/net/__init__.py, like this:
# in greputils/net/__init__.pyfrom.htmlimport(grep_html,grep_html_as_text,)from.jsonimport(grep_json,grep_json_many,)
This makes each of these functions importable from the greputils.net
submodule. So in the top-level __init__.py file, we can simply do a
relative import:
# in greputils/__init__.pyfrom.netimport(grep_html,grep_html_as_text,grep_json,grep_json_many,)
With this organization, you can import the grep_html() function from
greputils directly, but also from the greputils.net submodule:
# This...fromgreputilsimportgrep_html# Or this.fromgreputils.netimportgrep_html
Both of these will successfully import the function. Whether you want people to do both is a different question.
In this case, we started with a file named greputils/net.py, which
means everything was originally importable from greputils.net. For
this reason, there may be existing code which does that, in any
program which uses greputils. By arranging these two __init__.py
files in this way, we allow that code to continue working unmodified,
while also allowing grep_html to be imported from greputils
directly.
We have another choice. We could instead import from the most nested
submodules, directly into the top-level __init__.py file, like this:
# in greputils/__init__.pyfrom.net.htmlimport(grep_html,grep_html_as_text,)from.net.jsonimport(grep_json,grep_json_many,)# Plus the other relative imports from other submodules.
See how this is different? Rather than importing grep_html() from
.net, it is instead imported from .net.html—and similar for the
others.
So if you do it this way, what goes in net/__init__.py? Nothing. In
this case, you can make that file empty. The consequence is that from
greputils.net import grep_html will not work; you can only import it
from greputils. But if that would not break any existing code, then
you may decide there is no reason to support anything but the
top-level import. This is certainly an option if you are creating the
full module with a nested file structure from the start.
You might wonder: if net/__init__.py is empty, is it necessary? No,
it is not. Modern Python allows you to omit an empty __init__.py
file entirely. So the file list will look like this:
greputils/ greputils/__init__.py greputils/files.py greputils/contain.py greputils/net/html.py greputils/net/text.py greputils/net/json.py
See how there is only one __init__.py file, at the top-level module directory.
Notice that throughout this process, from the very start, the
interface to greputils did not change. Back when we had a single
greputils.py file, you could write from greputils import
grepfile. And you can write the same thing now, when greputils is a
directory instead of a file.
In other words, whether you implement your module as a single file or as a directory is just an implementation detail. The developer using your code doesn’t have to know or care how you made the module. When you make the change, people using the module probably won’t even notice.
This evolution is quite common. It happens just as I have described in realistic software development cycles.
There is an antipattern I have to tell you about. It looks like this:
fromsome_moduleimport*
You are using a “from...import” statement, but the final field is
not a sequence of components to import. Instead, it is the literal “*” character. For greputils, it would look like this:
fromgreputilsimport*
What this does is import every single object from the module. Every function, every class, every global variable. It drops them all into your current namespace. Like dumping a bucket of smelly fish all over the floor.
This is a problem for several reasons. But the worst reason is that it
creates a time bomb. Suppose your application uses greputils, and
also another module called filesearch. And imagine you write these
two import lines:
fromgreputilsimport*fromfilesearchimport*
You know that greputils has a function called grepfile(). And
let’s say the filesearch module does not contain anything with this
name, so there is no conflict. You test your code, it works great,
and you deploy it. Everything is great.
Now imagine several months pass, and filesearch has a critical
security update. So without you even knowing, someone in your
organization—on the DevOps team, or another developer—decides
to upgrade filesearch. But this update also introduces a new
function, called grepfile()—which is completely unrelated to the
grepfile() function in greputils, even though it has the same
name.
What does Python do? Because the import from filesearch comes
second, Python will silently override the greputils version. In
other words, when you call grepfile() in your program, you are
suddenly calling filesearch.grepfile(), not greputils.grepfile().
What happens next? If you are lucky, this will cause an immediate and obvious error. Unit tests will fail, or a manual test will catch it before you deploy.
If you are not lucky, you have a time bomb.
In this case, the code path using grepfile() is not immediately
triggered. It gets deployed to production, and everything appears to
work correctly. And the application may continue to run fine just long
enough for everyone to forget about the library upgrade.
Until, when you least expect it, that code path containing
grepfile() finally runs. Boom.
You may not be this unlucky. But if you are, this is just about the worst kind of bug. It is unpredictable and disruptive. And it can be hard to fix simply because it is far enough removed in time that the probable cause is no longer fresh in anyone’s mind.
You can get other problems from importing star, but in my opinion this is the worst one. The only protection is to not do it at all.
So what do you do instead? After all, when someone imports star, they’re not trying to ruin your life. Normally they do it because the module has many functions and classes they need to use.
There are several strategies. If you only need to import a few components, just import each by name. Just like we have been doing:
fromgreputilsimportgrepfile# And then later in your code:grepfile("pattern to match","/path/to/file.txt")
This has the advantage of being completely precise and specific. But it is inconvenient when you are importing more than a few items.
An alternative: simply import the module itself. This is just like importing star, except everything is namespaced inside the module. So you get none of the name conflict problems. Like this:
importgreputils# And then later in your code:greputils.grepfile("pattern to match","/path/to/file.txt")
The downside is that you have to type the module name
over and over. To alleviate this, Python lets you abbreviate the
module name when you import it, with an "as" clause. For example, we
can rename greputils to gu inside your code:
importgreputilsasgu# And then later in your code:gu.grepfile("pattern to match","/path/to/file.txt")
It is like creating a more convenient alias for the module, for use just inside your program. This is commonly used in different Python library ecosystems:
# Data scientists are smart enough to use this a lot.importnumpyasnpimportpandasaspd
Then in your code, you can refer to np.array, pd.DataFrame, and so on.
This renaming trick works not just with modules, but also with items imported from a module. So if you are calling a function with a long name over and over, you can give it a shorter or better name:
# Import a function and rename it:fromgreputilsimportigrepfileascigrep# Similar to:fromgreputilsimportigrepfilecigrep=igrepfile
This is not just useful for making function names shorter. It also can
improve clarity. For example, the tremendously useful dateutil
library provides a function that can automagically parse just about
any date-time string you give it, returning a nice datetime.datetime instance:
>>> from dateutil.parser import parse
>>> parse('Sat Aug 10 08:03:50 2074')
datetime.datetime(2074, 8, 10, 8, 3, 50)
>>> parse('01-05-2081 11:39pm')
datetime.datetime(2081, 1, 5, 23, 39)
>>> parse('5/15/57 22:29')
datetime.datetime(2057, 5, 15, 22, 29)
That is extremely useful. But parse() is almost the most generic
name you can give to a function. It just does not give enough of a
clue what the function actually does. You might as well call it
compute().
So what I do is rename the function when I import it:
fromdateutil.parserimportparseasparse_datetime
Then my code can invoke it with the much more informative name of
parse_datetime():
>>> parse_datetime('Sat Aug 10 08:03:50 2074')
datetime.datetime(2074, 8, 10, 8, 3, 50)
>>> parse_datetime('01-05-2081 11:39pm')
datetime.datetime(2081, 1, 5, 23, 39)
>>> parse_datetime('5/15/57 22:29')
datetime.datetime(2057, 5, 15, 22, 29)
While renaming on import can be useful, it is also possible to overdo it, to the point the code becomes harder to read. As a general rule, I recommend you use this feature to improve readability, and otherwise simply use the original name.
Normally, the files of your module will contain definitions of classes and functions, and perhaps assign top-level variables. They might also import from sub-modules which follow the same pattern.
Modules constructed this way have no side effects of code execution during the import itself. But that does not mean there is no execution of code. In fact, that is exactly what happens every time you import.
Few people understand this important point. It relates to what we
discussed near the start of this chapter, when we learned about the
concept of a “main guard”. Imagine you have a module with this code
(using ... as a placeholder for code I am omitting here):
CSV_FIELDS=['date','revenue',...]classMissingContent(Exception):...defextract_params(text):...classUserData:def__init__(self,first_name,last_name,):...
This module provides several components:
A list of strings, called CSV_FIELDS
An exception called MissingContent
A function called extract_params()
A class called UserData
All of these are objects. Yes, even the classes and the function, because in Python everything is an object. How were these objects created? They were created because Python executes the lines of code which define them.
To put it another way, consider these lines:
classUserData:def__init__(self,first_name,last_name,):...
This is a class statement. Python will execute this class
statement. Executing this statement will create a new
object, called UserData, which is a class. This will be inserted
into the current scope, so lines of code after this class
statement can create instances of the class. Lines of code prior to
this statement cannot, because the class statement which created the
class object has not been executed yet.
Everything said applies to function definitions, too:
defextract_params(text):...
This def statement is Python code which is executed. The effect of
that execution is to create a function object, named
extract_params(). After that statement is executed, other Python
code can invoke that function.
It is import to understand there is no special difference between
class and function definitions, versus “normal” code. In order to
create the class or function object, Python must execute the class
or def statements. At this level, it is exactly the same as a
statement like x = 1, as far as Python is concerned.
So when Python imports a module, it must execute that module’s code
first. But this is “all or nothing”. When you import a class, Python
does not scan through the code to find just that class statement,
and only execute that. It executes the whole module.2 That is why we need to use main
guards (i.e., the if __name__ == "__main__" trick).
Some modules exploit this to do import-time code execution on purpose. This is typically used for some kind of initialization or side effect that will take place when that specific module is first imported. For example, imagine a module which creates a database connection object at the top-level:
# database.py# Function to create a new network connection to the DBdefinitiate_database_connection():...# Go ahead and create a connection handle# This will be a top-level object,# importable from the module.conn=initiate_database_connection()
This is sometimes useful or necessary. But I suggest you avoid import-time side effects, unless you have a compelling reason.
For one, these import-time side effects are often surprising for people who reuse your code. That can include you, months down the road after you’ve written the module, and have completely forgotten that importing it will trigger some kind of microservice connection, for example. Another way of saying this: import-time side effects violate the Principle of Least Surprise.
Another problem: you cannot control the exact timing. Python will execute that module code at some point. But when is not defined by the language. It is hard to predict what order Python will execute individual module files. And the order can change without warning as you evolve the code, or when you upgrade to a new version of Python. Imagine a program where lines could change order each time you run it; import-time side effects are a bit like that.
Finally, they are inflexible. You cannot avoid executing that code if you do not want to; your only option is to not use the module at all. You generally cannot customize its behavior, because the import itself triggers the execution with hard-coded arguments.
The database example above is guilty of all these crimes. It’s a good example of what not to do.
That said, creating import-time side effects is sometimes necessary—or at least useful enough that it is worth the downsides. If you encounter that situation, do not worry about it; go ahead and do it. But always ask if you can organize your application to avoid it.
Python’s module system is so nicely designed that you can go far with it just by understanding a few basics. And having read this chapter, you know there is a lot more depth. That simple interface over complex and powerful semantics makes it possible to amplify the impact of the code you write. You create functions and classes which solve hard problems that people care about, package them in a nice module interface, and suddenly you have empowered a lot of folks to do more with less effort. It is a form of leverage. And whether you are distributing an open source project, creating a module for your team, or even just packaging your code to make it easier for you to reuse in the future, this investment of your effort always comes back to you in positive ways.
1 A component is a general term for some modular unit of software. Simple components can be a function; a class; or an instance of a class. Components can also be larger-scale structures, such as whole libraries or services. But in this chapter, “component” just means “something you can import from a Python module”.
2 Could Python be designed to simply execute the relevant class statement only? Not really. Python is so dynamic as a language, that a class which is defined early in the file can be renamed by a line near the end of that file. The only way to ensure the correct class is imported is to execute the whole module.