Chapter 8. Module Organization

For anything more than a small program, you will want to organize your code into modules. This is also the unit of organization for reusable libraries—both libraries you create, and outside code you import into your own codebase.

Python’s module system is a delight: easy to use, well designed, and extremely flexible. Making full use requires understanding its mechanisms for imports, and how that works with namespacing. We will dive deep into all of that in this chapter.

In particular, we will focus on how modules evolve. Requirements change over time; as new requirements come in, and as you get more clarity on existing requirements, and how to best organize your codebase. So we will cover the best practices for refactoring and updating module structure during the development process.

This is an important and practical part of working with modules. But for some reason, it is never talked or written about. Until now.

Spawning a Module

To touch on everything important about modules, we will follow the lifecycle of a small Python script that gradually grows in scope and lines of code—eventually growing to a point where we want to package its components​1 into reusable modules. We do this not just for sensible organization, but also so we can import them into other applications. This evolution from script to spin-off library happens all the time in real software development.

Imagine you create a little Python script, called findpattern.py. Its job is simple: to scan through a text file, and print out those lines which contain a certain substring.

This is similar to a program called grep in the Linux (and Unix) world, which operates on a stream of lines of text, filtering for those which match a pattern you specify. findpattern.py is less powerful than grep. But it is more portable, and—more importantly—lets us illustrate everything we need in this chapter.

# findpattern.py
import sys

def grepfile(pattern, path):
    with open(path) as handle:
        for line in handle:
            if pattern in line:
                yield line.rstrip('\n')

pattern, path = sys.argv[1], sys.argv[2]
for line in grepfile(pattern, path):
    print(line)

This program defines a generator function called grepfile(), which may remind you of the matching_lines_from_file() function in Chapter 1—because it demonstrates many of the same Python best practices. It takes two arguments: pattern, the substring pattern to search for, and path, the path to the file to search in.

Imagine you have a file named log.txt, containing simple log messages, one per line, like this:

ERROR: Out of milk
WARNING: Running low on orange juice
INFO: Temperature adjusted to 39 degrees
ERROR: Alien spacecraft crashed

Your program needs to get these two values from the command line. So you invoke the program like this:

% python findpattern.py ERROR log.txt

It is best to do extract the command-line arguments with a specialist library for parsing them, such as Python’s built-in argparse module. But to avoid a side quest explaining how that works, we will simply extract them from argv in the sys module.

sys.argv is a list of strings. Its first element, at index 0, is the name of the program you are running; so its value will be "findpattern.py" in this case. What we want are the command-line arguments, ERROR and log.txt. These will be at indices 1 and 2 in sys.argv, respectively. So we load those into the pattern and path variables. With that out of the way, we can simply iterate through the generator object returned by grepfile(), printing out each matching line.

Whenever you create a Python file, that also creates a module of the same name (minus the .py extension). So our findpattern.py file, in addition to being a complete program, also creates a module named findpattern. This happens automatically; you do not have to declare this, or take any extra steps.

Whenever you have a module, that means you can import from it. In particular, we have a nice function named grepfile(), which is general enough that we may want to reuse it in other programs. Let’s do that.

You decide to create a program called finderrors.py to show only the ERROR lines—similar to findpattern.py, but more specialized.

grepfile() is a good tool for implementing this, so you decide to import it. Your first version looks like this:

# finderrors.py
import sys
from findpattern import grepfile

path = sys.argv[1]
for line in grepfile('ERROR:', path):
    print(line)

This looks straightforward. But when you run the program, you get a strange error:

$ python finderrors.py log.txt
Traceback (most recent call last):
  File "finderrors.py", line 3, in <module>
    from findpattern import grepfile
  File "findpattern.py", line 10, in <module>
    pattern, path = sys.argv[1], sys.argv[2]
IndexError: list index out of range

IndexError? What on earth would cause that?

Let’s step through the stack trace. You can see that the error originates in line 3 of finderrors.py. That is interesting, because it is the import line—where grepfile() is imported from the findpattern module.

Next, the stack trace descends into the findpattern.py file. Specifically, on line 10, where it unpacks variables from sys.argv.

Now you understand the problem. findpattern.py was written to use as a standalone program. But we are creating a new program and want to reuse code. In the process of importing, sys.argv is unpacked, expecting two arguments, instead of the one argument which finderrors.py actually takes.

When you import from a module, Python creates that module by executing its code. It executes the def statements to create the functions, and executes the class statements to create the classes. But in the process of creating the module, Python executes the entire file—including that sys.argv line.

So what do we do? The code is correct for running findpattern.py. But it doesn’t allow finderrors.py to run. How do we import from findpattern and allow both programs to run?

The solution is to use a main guard. It looks like this:

# Replace the final 3 lines of findpattern.py with this:
if __name__ == "__main__":
    pattern, path = sys.argv[1], sys.argv[2]
    for line in grepfile(pattern, path):
        print(line)

We have taken the last three lines and indented them inside an “if” block. The “if” condition references a magic variable called __name__. This variable is automatically and globally defined, and will always have a string value. But to understand the nature of that value, we must first understand something about how Python programs are executed.

When you execute a Python program, often the code which makes up the program is spread across more than one file. That is what happens here, right? When you run finderrors.py, some of this program’s code is in that file. But some of the code is in the findpattern.py file. So the Python code for this program is distributed over two different files.

But here’s the important point: whenever you run a Python program consisting of several files, one of those will be the main executable for the program. That is the file which will show in the process table or task manager of the operating system. When you run the finderrors.py program, then finderrors.py is the main executable. That is what will show up in the process table, even though it is also utilizing code in the findpattern.py file. Of course, real Python programs often use dozens, hundreds, or even thousands of distinct Python files, especially when your program uses many libraries.

And so, back to __name__. This magic variable will have one of two values:

  • If it is referenced inside the main executable file, its value will be the string “__main__”.

  • If it is referenced in any other Python file, it will be the name of the module that file creates.

And so we use this in findpattern.py, to make this file usable as a stand-alone program, and as a module which can be imported from. By checking the value of __name__, we effectively partition the file into two parts: code which is always executed (and thus its objects are importable), and code which is executed only when this file is itself run as a program.

Here is the full source of our amended findpattern.py:

import sys

def grepfile(pattern, path):
    with open(path) as handle:
        for line in handle:
            if pattern in line:
                yield line.rstrip('\n')

if __name__ == "__main__":
    pattern, path = sys.argv[1], sys.argv[2]
    for line in grepfile(pattern, path):
        print(line)

Let’s step through this. Suppose you execute findpattern.py as the program. Then the findpattern.py file is the main executable, so the value of __name__ will be equal to “__main__”. This means the “if” condition evaluates to True, and the final three lines are executed.

In contrast, imagine you run finderrors.py. This file imports from the findpattern module, which means all lines in the findpattern.py file are executed. But the value of __name__ is set to the string “findpattern”. So the “if” condition evaluates to False, and the final three lines are skipped. Now both programs work:

$ python findpattern.py ERROR log.txt
ERROR: Out of milk
ERROR: Alien spacecraft crashed
$ python finderrors.py log.txt
ERROR: Out of milk
ERROR: Alien spacecraft crashed

Creating Separate Libraries

Time passes, and new requirements roll in—as they always do. You now need a variant of the grepfile() function, which is case insensitive. Let’s call it igrepfile().

In your team, suppose you are in charge of finderrors.py, and your coworker is in charge of findpattern.py. You are importing the grepfile() function from find​pat⁠tern.py. But where are you going to put the new igrepfile() function?

You cannot put it in findpattern.py. Your coworker is politely allowing you to reuse his code by importing from the findpattern module, but he draws the line at adding new functions he does not care about to his file. That is just creating more maintenance work for him.

At this point, it makes more sense to organize the code into a different module. Up to now, findpattern.py has been filling two completely different roles. It’s a program that does something useful, and it is a container of sorts: holding components that can be reused by other programs—the grepfile() function, in particular.

Now you are going to separate these roles. You create a new file, called greputils.py. It is not meant to be an executable program. Its sole purpose is to create a module, called greputils, holding reusable code.

In particular, it will define the grepfile() function. You cut the entire definition of grepfile() from findpattern.py, and paste it into greputils.py. Then in both findpattern.py and finderrors.py, you add this import statement:

from greputils import grepfile

Congratulations: you have created a reusable library for your team’s software.

This is a great step forward in code organization. You can stuff whatever new classes and functions make sense in this greputils module…​without molesting any other program which relies on it. A clear separation of roles.

Notice something else: this has evolved naturally. The entire process described above is fully realistic, in terms of how a module of reusable code “emerges” during normal application development. This example shows how it happens in a team, but it happens much the same way when you are developing solo. As you develop new programs, you naturally find you want to reuse functions and classes from your previous projects, and it only makes sense to collect them all in a single convenient module.

The current versions of our files:

# greputils.py
def grepfile(pattern, path):
    with open(path) as handle:
        for line in handle:
            if pattern in line:
                yield line.rstrip('\n')
# findpattern.py
import sys

from greputils import grepfile
pattern, path = sys.argv[1], sys.argv[2]
for line in grepfile(pattern, path):
    print(line)
# finderrors.py
import sys
from greputils import grepfile

path = sys.argv[1]
for line in grepfile('ERROR:', path):
    print(line)

Continuing to the next phase of our module’s evolution, imagine you keep adding new functions and classes over time. So the greputils.py file gets bigger and bigger, with more and more lines of code.

To make it concrete: imagine you often have to check whether a text file contains a certain substring. You just want a yes or no answer to that question; you don’t need the set of lines that match, you just want to know whether at least one line matches, or none of them do.

This function should return True if the text file contains that string, and False otherwise. You call this function contains():

def contains(pattern, path):
    with open(path) as handle:
        for line in handle:
            if pattern in line:
                return True
    return False

You also create a case-insensitive version named icontains(), and so on; you continue creating new functions and classes over time as you and your teammates need them, always adding them to greputils.py.

Eventually, this file will just get too big. It is going to be awkward to work with at best, and encourage excessive intercoupling and code entanglement at worst. If there are multiple developers hacking the same codebase, it may even increase the frequency of merge conflicts. And code logically separated into different files is just easier to work with, as separate tabs in IDEs.

So we refactor the module. Make this distinction: the module is not the file greputils.py. Rather, the module is greputils, which happens to be implemented as a file named greputils.py. But there are other ways to implement the same module, with the same contents and the exact same interface for importing.

Specifically, we can implement the module as a collection of several files, organized in a particular way. How do you do that?

Multifile Modules

The first step is to create a directory, with the name of the module: greputils. And this directory will contain several files.

The first is a file named __init__.py. Just like the method for Python classes, except with .py at the end. This is where we create the interface to the module, in terms of what you can import from it directly.

But in order to do that, we first create another file in the greputils folder, which we will call files.py. This is where we will put the file-grepping functions: grepfile(), igrepfile(), and any others we have created so far. We will also create a third file, named contain.py, which is where we put the “does this file contain this string” functions, like contains() and icontains(). Now, our filesystem layout looks like this:

greputils/
greputils/__init__.py
greputils/files.py
greputils/contain.py

This solves our giant file problem. We can put as many files in this greputils folder as we want. No matter how many classes and functions we invent, we can split them up into reasonably sized files, each defining a submodule within the main greputils module.

But we have one more step, which returns us to the __init__.py file. Remember I said this file creates the import interface of the module. Originally, when everything lived in a single greputils.py file, you could write code like:

# used by findpattern.py
from greputils import grepfile

# used by a different program
from greputils import icontains

When you refactor a function, you want its interface and outward behavior to be unchanged. Your modified implementation should be invisible to anyone using that function. A close analogy holds for modules. When we “refactor” it to be a directory rather than a single file, we want these import statements to continue to work just as they did before.

When you implement a module as a single file, people can automatically import anything in the module with a simple from MODULENAME import ... statement. But when you implement a module as a directory, that is not automatic. You must explicitly declare what components will be directly importable in this way.

You do that in the __init__.py file. To see how it works, we must understand a new concept, called submodules.

When you implement a module as a directory, the files in that directory create submodules. These are accessed in a hierarchy under the top-level module, using a dotted-name syntax. For example, if grepfile() is in the greputils/files.py file, then you can import it with a statement like:

from greputils.files import grepfile

But we do not want to break all our existing import statements. Even if we did not mind the extra typing. The solution relies on the fact that anything inside greputils/__init__.py will be directly importable from the module itself. So all we have to do is import the submodule contents into the __init__.py file. One way is to do it like this:

# inside greputils/__init__.py
# (Spoiler: Don't do it like this, there is a better way).
from greputils.files import grepfile

That will actually work. But it is not as modular as it could be. What if you change the name of the module in the future? This creates a new place we need to change it, or miss it and create a bug. Or maybe we want to reorganize the module in some other way, which creates the same problem.

Instead, do a relative import. Inside the __init__.py file, write this:

# inside greputils/__init__.py
from .files import grepfile

So you are importing from .files, not greputils.files. That leading “.” is important here. This is what makes it a relative import, and its syntax only works inside a module that is implemented as a directory. When you use it inside the __init__.py file, it will import objects from sub-modules relative to that directory, and make them accessible inside the __init__.py file.

We do all this because anything in the __init__.py file is directly importable from the module itself. In other words, because of these relative imports, this statement works again:

from greputils import grepfile

We do this for all components for all the submodules. Our final __init__.py file, including everything we have created so far, looks like this:

# greputils/__init__.py
from .files import grepfile, igrepfile
from .contain import contains, icontains

Now we can import all these functions—grepfile(), igrepfile(), contains(), and icontains()—directly:

# In one program...
from greputils import contains, grepfile

# And in another:
from greputils import icontains, igrepfile

You can be selective in what you import into the __init__.py file. Your submodules may contain internal helper functions and classes which you do not necessarily want other people importing, using, and crafting their code to depend on. This is likely to be common, in fact, as your modules grow beyond a certain size.

Handling this is easy: simply do not import them into the __init__.py file. Then they are not directly importable from the module. Someone can still import them from the dotted path of the submodule if they peek into the source to find they are there, but few people will do that. And if you change one of those internal components later in a way that breaks their code, that is arguably their problem and not yours.

Import Syntax and Version Control

When importing multiple components, you can do it all on one line, with a single import statement. The greputils/__init__.py does this, importing from its submodules:

# greputils/__init__.py
from .files import grepfile, igrepfile
from .contain import contains, icontains

If you are importing a couple of items, that works fine. But suppose you are importing more. Putting all those on one line creates problems, especially if you are working in a team.

Imagine you and I are developing the same codebase, in different feature branches. In your branch, you add a new function named grepfileregex(). You put this in grep​u⁠tils/files.py, and modify greputils/__init__.py like this:

from .files import grepfile, igrepfile, grepfileregex

In my branch, I rename igrepfile() to grepfilei(). And edit the __init__.py file like this:

from .files import grepfile, grepfilei

When we finish our work in the branch, and merge into main, suddenly we have a race condition. Whoever merges first will have no problem. But the one who merges last will get a merge conflict. Our current version-control tools do not know how to resolve this manually; the loser of this race condition has to manually clean it up. It is time-consuming at best, and it risks creating new bugs at worst.

There is an easy solution. Python allows you to split imported components over several lines, using this syntax:

from .files import (
    grepfile,
    igrepfile,
)
from .contain import (
    contains,
    icontains,
)

Note the parentheses. This is much better, because our version control software can resolve the merge automatically, without the risk of introducing new bugs. As a general rule, if you are importing more than one item, your life will be happier if you split the import over several lines in this way.

Notice another detail here. Just looking at the first relative import:

from .files import (
    grepfile,
    igrepfile,
)

Do you see how there is a comma after “igrepfile”, even though it is the last item in the sequence? Python does not require you to put a comma there. But I recommend you do, because it pinpoints the diffs even further.

Imagine you do not put that comma there, then add your new grepfileregex() function. So you are changing it from this:

# no ending comma
from .files import (
    grepfile,
    igrepfile
)

…​to this:

from .files import (
    grepfile,
    igrepfile,
    grepfileregex
)

You are adding a line for grepfileregex() but also modifying the previous line—adding a comma after igrepfile. So the diff will delete the old line without the comma, and add two lines. Like this:

-    igrepfile
+    igrepfile,
+    grepfileregex

In contrast, if you append a comma after every imported item, you are changing it from this:

# WITH an ending comma
from .files import (
    grepfile,
    igrepfile,
)

…to this:

from .files import (
    grepfile,
    igrepfile,
    grepfileregex,
)

This means your diff goes from three lines down to just one:

+    grepfileregex,

Like I said: more pinpointed diffs. Only good can come from that. Not only will you have fewer merge conflicts. If your team is doing code reviews, the reviewer will have to think less to decipher what you are changing. They are less likely to miss a bug that would otherwise be easy for them to catch.

Nested Submodule Structure

This directory structure for modules is recursive. A module can be implemented as a file, or as a directory; and this also applies to submodules, sub-submodules, and so on.

Imagine you add a new submodule, called greputils.net, collecting functions that check for contents of URLs over a network. At first, greputils.net does not have many functions or classes. So this submodule is implemented as a single file, net.py, in the greputils folder.

But over time, you add something like this:

greputils/
greputils/__init__.py
greputils/files.py
greputils/contain.py
greputils/net/__init__.py
greputils/net/html.py
greputils/net/text.py
greputils/net/json.py

At this point, you have a choice to make about the module interface—which will be somewhat dictated by how other code is using the greputils.net components already. In the first choice, you will import components into greputils/net/__init__.py, like this:

# in greputils/net/__init__.py
from .html import (
   grep_html,
   grep_html_as_text,
)
from .json import (
   grep_json,
   grep_json_many,
)

This makes each of these functions importable from the greputils.net submodule. So in the top-level __init__.py file, we can simply do a relative import:

# in greputils/__init__.py
from .net import (
   grep_html,
   grep_html_as_text,
   grep_json,
   grep_json_many,
)

With this organization, you can import the grep_html() function from greputils directly, but also from the greputils.net submodule:

# This...
from greputils import grep_html
# Or this.
from greputils.net import grep_html

Both of these will successfully import the function. Whether you want people to do both is a different question.

In this case, we started with a file named greputils/net.py, which means everything was originally importable from greputils.net. For this reason, there may be existing code which does that, in any program which uses greputils. By arranging these two __init__.py files in this way, we allow that code to continue working unmodified, while also allowing grep_html to be imported from greputils directly.

We have another choice. We could instead import from the most nested submodules, directly into the top-level __init__.py file, like this:

# in greputils/__init__.py
from .net.html import (
   grep_html,
   grep_html_as_text,
)
from .net.json import (
   grep_json,
   grep_json_many,
)

# Plus the other relative imports from other submodules.

See how this is different? Rather than importing grep_html() from .net, it is instead imported from .net.html—and similar for the others.

So if you do it this way, what goes in net/__init__.py? Nothing. In this case, you can make that file empty. The consequence is that from greputils.net import grep_html will not work; you can only import it from greputils. But if that would not break any existing code, then you may decide there is no reason to support anything but the top-level import. This is certainly an option if you are creating the full module with a nested file structure from the start.

You might wonder: if net/__init__.py is empty, is it necessary? No, it is not. Modern Python allows you to omit an empty __init__.py file entirely. So the file list will look like this:

greputils/
greputils/__init__.py
greputils/files.py
greputils/contain.py
greputils/net/html.py
greputils/net/text.py
greputils/net/json.py

See how there is only one __init__.py file, at the top-level module directory.

Notice that throughout this process, from the very start, the interface to greputils did not change. Back when we had a single greputils.py file, you could write from greputils import grepfile. And you can write the same thing now, when grep​u⁠tils is a directory instead of a file.

In other words, whether you implement your module as a single file or as a directory is just an implementation detail. The developer using your code doesn’t have to know or care how you made the module. When you make the change, people using the module probably won’t even notice.

This evolution is quite common. It happens just as I have described in realistic software development cycles.

Antipattern Warning

There is an antipattern I have to tell you about. It looks like this:

from some_module import *

You are using a “from...import” statement, but the final field is not a sequence of components to import. Instead, it is the literal “*” character. For greputils, it would look like this:

from greputils import *

What this does is import every single object from the module. Every function, every class, every global variable. It drops them all into your current namespace. Like dumping a bucket of smelly fish all over the floor.

This is a problem for several reasons. But the worst reason is that it creates a time bomb. Suppose your application uses greputils, and also another module called filesearch. And imagine you write these two import lines:

from greputils import *
from filesearch import *

You know that greputils has a function called grepfile(). And let’s say the filesearch module does not contain anything with this name, so there is no conflict. You test your code, it works great, and you deploy it. Everything is great.

Now imagine several months pass, and filesearch has a critical security update. So without you even knowing, someone in your organization—on the DevOps team, or another developer—decides to upgrade filesearch. But this update also introduces a new function, called grepfile()—which is completely unrelated to the grepfile() function in greputils, even though it has the same name.

What does Python do? Because the import from filesearch comes second, Python will silently override the greputils version. In other words, when you call grepfile() in your program, you are suddenly calling filesearch.grepfile(), not greputils.grepfile().

What happens next? If you are lucky, this will cause an immediate and obvious error. Unit tests will fail, or a manual test will catch it before you deploy.

If you are not lucky, you have a time bomb.

In this case, the code path using grepfile() is not immediately triggered. It gets deployed to production, and everything appears to work correctly. And the application may continue to run fine just long enough for everyone to forget about the library upgrade.

Until, when you least expect it, that code path containing grepfile() finally runs. Boom.

You may not be this unlucky. But if you are, this is just about the worst kind of bug. It is unpredictable and disruptive. And it can be hard to fix simply because it is far enough removed in time that the probable cause is no longer fresh in anyone’s mind.

You can get other problems from importing star, but in my opinion this is the worst one. The only protection is to not do it at all.

So what do you do instead? After all, when someone imports star, they’re not trying to ruin your life. Normally they do it because the module has many functions and classes they need to use.

There are several strategies. If you only need to import a few components, just import each by name. Just like we have been doing:

from greputils import grepfile

# And then later in your code:
grepfile("pattern to match", "/path/to/file.txt")

This has the advantage of being completely precise and specific. But it is inconvenient when you are importing more than a few items.

An alternative: simply import the module itself. This is just like importing star, except everything is namespaced inside the module. So you get none of the name conflict problems. Like this:

import greputils

# And then later in your code:
greputils.grepfile("pattern to match", "/path/to/file.txt")

The downside is that you have to type the module name over and over. To alleviate this, Python lets you abbreviate the module name when you import it, with an "as" clause. For example, we can rename greputils to gu inside your code:

import greputils as gu

# And then later in your code:
gu.grepfile("pattern to match", "/path/to/file.txt")

It is like creating a more convenient alias for the module, for use just inside your program. This is commonly used in different Python library ecosystems:

# Data scientists are smart enough to use this a lot.
import numpy as np
import pandas as pd

Then in your code, you can refer to np.array, pd.DataFrame, and so on.

This renaming trick works not just with modules, but also with items imported from a module. So if you are calling a function with a long name over and over, you can give it a shorter or better name:

# Import a function and rename it:
from greputils import igrepfile as cigrep

# Similar to:
from greputils import igrepfile
cigrep = igrepfile

This is not just useful for making function names shorter. It also can improve clarity. For example, the tremendously useful dateutil library provides a function that can automagically parse just about any date-time string you give it, returning a nice datetime.datetime instance:

>>> from dateutil.parser import parse
>>> parse('Sat Aug 10 08:03:50 2074')
datetime.datetime(2074, 8, 10, 8, 3, 50)
>>> parse('01-05-2081 11:39pm')
datetime.datetime(2081, 1, 5, 23, 39)
>>> parse('5/15/57 22:29')
datetime.datetime(2057, 5, 15, 22, 29)

That is extremely useful. But parse() is almost the most generic name you can give to a function. It just does not give enough of a clue what the function actually does. You might as well call it compute().

So what I do is rename the function when I import it:

from dateutil.parser import parse as parse_datetime

Then my code can invoke it with the much more informative name of parse_datetime():

>>> parse_datetime('Sat Aug 10 08:03:50 2074')
datetime.datetime(2074, 8, 10, 8, 3, 50)
>>> parse_datetime('01-05-2081 11:39pm')
datetime.datetime(2081, 1, 5, 23, 39)
>>> parse_datetime('5/15/57 22:29')
datetime.datetime(2057, 5, 15, 22, 29)

While renaming on import can be useful, it is also possible to overdo it, to the point the code becomes harder to read. As a general rule, I recommend you use this feature to improve readability, and otherwise simply use the original name.

Import Side Effects

Normally, the files of your module will contain definitions of classes and functions, and perhaps assign top-level variables. They might also import from sub-modules which follow the same pattern.

Modules constructed this way have no side effects of code execution during the import itself. But that does not mean there is no execution of code. In fact, that is exactly what happens every time you import.

Few people understand this important point. It relates to what we discussed near the start of this chapter, when we learned about the concept of a “main guard”. Imagine you have a module with this code (using ... as a placeholder for code I am omitting here):

CSV_FIELDS = [
    'date',
    'revenue',
    ...
]

class MissingContent(Exception):
    ...

def extract_params(text):
    ...

class UserData:
    def __init__(self, first_name, last_name, email):
       ...

This module provides several components:

  • A list of strings, called CSV_FIELDS

  • An exception called MissingContent

  • A function called extract_params()

  • A class called UserData

All of these are objects. Yes, even the classes and the function, because in Python everything is an object. How were these objects created? They were created because Python executes the lines of code which define them.

To put it another way, consider these lines:

class UserData:
    def __init__(self, first_name, last_name, email):
       ...

This is a class statement. Python will execute this class statement. Executing this statement will create a new object, called UserData, which is a class. This will be inserted into the current scope, so lines of code after this class statement can create instances of the class. Lines of code prior to this statement cannot, because the class statement which created the class object has not been executed yet.

Everything said applies to function definitions, too:

def extract_params(text):
    ...

This def statement is Python code which is executed. The effect of that execution is to create a function object, named extract_params(). After that statement is executed, other Python code can invoke that function.

It is import to understand there is no special difference between class and function definitions, versus “normal” code. In order to create the class or function object, Python must execute the class or def statements. At this level, it is exactly the same as a statement like x = 1, as far as Python is concerned.

So when Python imports a module, it must execute that module’s code first. But this is “all or nothing”. When you import a class, Python does not scan through the code to find just that class statement, and only execute that. It executes the whole module.2 That is why we need to use main guards (i.e., the if __name__ == "__main__" trick).

Some modules exploit this to do import-time code execution on purpose. This is typically used for some kind of initialization or side effect that will take place when that specific module is first imported. For example, imagine a module which creates a database connection object at the top-level:

# database.py

# Function to create a new network connection to the DB
def initiate_database_connection():
    ...

# Go ahead and create a connection handle
# This will be a top-level object,
# importable from the module.
conn = initiate_database_connection()

This is sometimes useful or necessary. But I suggest you avoid import-time side effects, unless you have a compelling reason.

For one, these import-time side effects are often surprising for people who reuse your code. That can include you, months down the road after you’ve written the module, and have completely forgotten that importing it will trigger some kind of microservice connection, for example. Another way of saying this: import-time side effects violate the Principle of Least Surprise.

Another problem: you cannot control the exact timing. Python will execute that module code at some point. But when is not defined by the language. It is hard to predict what order Python will execute individual module files. And the order can change without warning as you evolve the code, or when you upgrade to a new version of Python. Imagine a program where lines could change order each time you run it; import-time side effects are a bit like that.

Finally, they are inflexible. You cannot avoid executing that code if you do not want to; your only option is to not use the module at all. You generally cannot customize its behavior, because the import itself triggers the execution with hard-coded arguments.

The database example above is guilty of all these crimes. It’s a good example of what not to do.

That said, creating import-time side effects is sometimes necessary—or at least useful enough that it is worth the downsides. If you encounter that situation, do not worry about it; go ahead and do it. But always ask if you can organize your application to avoid it.

Conclusion

Python’s module system is so nicely designed that you can go far with it just by understanding a few basics. And having read this chapter, you know there is a lot more depth. That simple interface over complex and powerful semantics makes it possible to amplify the impact of the code you write. You create functions and classes which solve hard problems that people care about, package them in a nice module interface, and suddenly you have empowered a lot of folks to do more with less effort. It is a form of leverage. And whether you are distributing an open source project, creating a module for your team, or even just packaging your code to make it easier for you to reuse in the future, this investment of your effort always comes back to you in positive ways.

1 A component is a general term for some modular unit of software. Simple components can be a function; a class; or an instance of a class. Components can also be larger-scale structures, such as whole libraries or services. But in this chapter, “component” just means “something you can import from a Python module”.

2 Could Python be designed to simply execute the relevant class statement only? Not really. Python is so dynamic as a language, that a class which is defined early in the file can be renamed by a line near the end of that file. The only way to ensure the correct class is imported is to execute the whole module.