Chapter 8. Strings and Regular Expressions

Strings are not like integers, floats, and booleans. A string is a sequence, which means it contains multiple values in a particular order. In this chapter we’ll see how to access the values that make up a string, and we’ll use functions that process strings.

We’ll also use regular expressions, which are a powerful tool for finding patterns in a string and performing operations like search and replace.

As an exercise, you’ll have a chance to apply these tools to a word game called Wordle.

A String Is a Sequence

A string is a sequence of characters. A character can be a letter (in almost any alphabet), a digit, a punctuation mark, or whitespace.

You can select a character from a string with the bracket operator. This example statement selects character number 1 from fruit and assigns it to letter:

fruit = 'banana'
letter = fruit[1]
       

The expression in brackets is an index, so called because it indicates which character in the sequence to select. But the result might not be what you expect:

letter
       
'a'
       

The letter with index 1 is actually the second letter of the string. An index is an offset from the beginning of the string, so the offset of the first letter is 0:

fruit[0]
       
'b'
       

You can think of 'b' as the 0th letter of 'banana'—pronounced “zero-eth.”

The index in brackets can be a variable:

i = 1
fruit[i]
       
'a'
       

Or an expression that contains variables and operators:

fruit[i+1]
       
'n'
       

But the value of the index has to be an integer—otherwise you get a TypeError:

fruit[1.5]
       
TypeError: string indices must be integers
       

As we saw in Chapter 1, we can use the built-in function len to get the length of a string:

n = len(fruit)
n
       
6
       

To get the last letter of a string, you might be tempted to write this:

fruit[n]
       
IndexError: string index out of range
       

But that causes an IndexError because there is no letter in 'banana' with the index 6. Because we started counting at 0, the six letters are numbered 0 to 5. To get the last character, you have to subtract 1 from n:

fruit[n-1]
       
'a'
       

But there’s an easier way. To get the last letter in a string, you can use a negative index, which counts backward from the end:

fruit[-1]
        
'a'
        

The index -1 selects the last letter, -2 selects the second to last, and so on.

Writing Files

String operators and methods are useful for reading and writing text files. As an example, we’ll work with the text of Dracula, a novel by Bram Stoker that is available from Project Gutenberg. I’ve downloaded the book in a plain-text file called pg345.txt, which we can open for reading like this:

reader = open('pg345.txt')
        

In addition to the text of the book, this file contains a section at the beginning with information about the book and a section at the end with information about the license. Before we process the text, we can remove this extra material by finding the special lines at the beginning and end that begin with '***'.

The following function takes a line and checks whether it is one of the special lines. It uses the startswith method, which checks whether a string starts with a given sequence of characters:

def is_special_line(line):
    return line.startswith('*** ')
        

We can use this function to loop through the lines in the file and print only the special lines:

for line in reader:
    if is_special_line(line):
        print(line.strip())
        
*** START OF THE PROJECT GUTENBERG EBOOK DRACULA ***
*** END OF THE PROJECT GUTENBERG EBOOK DRACULA ***
        

Now let’s create a new file, called pg345_cleaned.txt, that contains only the text of the book. To loop through the book again, we have to open it again for reading. And, to write a new file, we can open it for writing:

reader = open('pg345.txt')
writer = open('pg345_cleaned.txt', 'w')
        

open takes an optional parameter that specifies the “mode”—in this example, 'w' indicates that we’re opening the file for writing. If the file doesn’t exist, it will be created; if it already exists, the contents will be replaced.

As a first step, we’ll loop through the file until we find the first special line:

for line in reader:
    if is_special_line(line):
        break
        

The break statement “breaks” out of the loop—that is, it causes the loop to end immediately, before we get to the end of the file.

When the loop exits, line contains the special line that made the conditional true:

line
        
'*** START OF THE PROJECT GUTENBERG EBOOK DRACULA ***\n'
        

Because reader keeps track of where it is in the file, we can use a second loop to pick up where we left off.

The following loop reads the rest of the file, one line at a time. When it finds the special line that indicates the end of the text, it breaks out of the loop. Otherwise, it writes the line to the output file:

for line in reader:
    if is_special_line(line):
        break
    writer.write(line)
        

When this loop exits, line contains the second special line:

line
        
'*** END OF THE PROJECT GUTENBERG EBOOK DRACULA ***\n'
        

At this point reader and writer are still open, which means we could keep reading lines from reader or writing lines to writer. To indicate that we’re done, we can close both files by invoking the close method:

reader.close()
writer.close()
        

To check whether this process was successful, we can read the first few lines from the new file we just created:

for line in open('pg345_cleaned.txt'):
    line = line.strip()
    if len(line) > 0:
        print(line)
    if line.endswith('Stoker'):
        break
        
DRACULA
_by_
Bram Stoker
        

The endswith method checks whether a string ends with a given sequence of characters.

Regular Expressions

If we know exactly what sequence of characters we’re looking for, we can use the in operator to find it and the replace method to replace it. But there is another tool, called a regular expression, that can also perform these operations—and a lot more.

To demonstrate, I’ll start with a simple example and we’ll work our way up. Suppose, again, that we want to find all lines that contain a particular word. For a change, let’s look for references to the titular character of the book, Count Dracula. Here’s a line that mentions him:

text = "I am Dracula; and I bid you welcome, Mr. Harker, to my house."
        

And here’s the pattern we’ll use to search:

pattern = 'Dracula'
        

A module called re provides functions related to regular expressions. We can import it like this and use the search function to check whether the pattern appears in the text:

import re

result = re.search(pattern, text)
result
        
<re.Match object; span=(5, 12), match='Dracula'>
        

If the pattern appears in the text, search returns a Match object that contains the results of the search. Among other information, it has a variable named string that contains the text that was searched:

result.string
        
'I am Dracula; and I bid you welcome, Mr. Harker, to my house.'
        

It also provides a function called group that returns the part of the text that matched the pattern:

result.group()
        
'Dracula'
        

And it provides a function called span that returns the index in the text where the pattern starts and ends:

result.span()
        
(5, 12)
        

If the pattern doesn’t appear in the text, the return value from search is None:

result = re.search('Count', text)
print(result)
        
None
        

So we can check whether the search was successful by checking whether the result is None:

result == None
        
True
        

Putting all that together, here’s a function that loops through the lines in the book until it finds one that matches the given pattern, and returns the Match object:

def find_first(pattern):
    for line in open('pg345_cleaned.txt'):
        result = re.search(pattern, line)
            if result != None:
                return result
        

We can use it to find the first mention of a character:

result = find_first('Harker')
result.string
        
'CHAPTER I. Jonathan Harker’s Journal\n'
        

For this example, we didn’t have to use regular expressions—we could have done the same thing more easily with the in operator. But regular expressions can do things the in operator cannot.

For example, if the pattern includes the vertical bar character, '|', it can match either the sequence on the left or the sequence on the right. Suppose we want to find the first mention of Mina Murray in the book, but we are not sure whether she is referred to by first name or last. We can use the following pattern, which matches either name:

pattern = r'Mina|Murray'
result = find_first(pattern)
result.string
        
'CHAPTER V. Letters—Lucy and Mina\n'
        

We can use a pattern like this to see how many times a character is mentioned by either name. Here’s a function that loops through the book and counts the number of lines that match the given pattern:

def count_matches(pattern):
    count = 0
    for line in open('pg345_cleaned.txt'):
        result = re.search(pattern, line)
        if result != None:
            count += 1
    return count
        

Now let’s see how many times Mina is mentioned:

count_matches('Mina|Murray')
        
229
        

The special character '^' matches the beginning of a string, so we can find a line that starts with a given pattern:

result = find_first('^Dracula')
result.string
        
'Dracula, jumping to his feet, said:--\n'
        

And the special character '$' matches the end of a string, so we can find a line that ends with a given pattern (ignoring the newline at the end):

result = find_first('Harker$')
result.string
        
"by five o'clock, we must start off; for it won't do to leave Mrs. Harker\n"
        

String Substitution

Bram Stoker was born in Ireland, and when Dracula was published in 1897, he was living in England. So we would expect him to use the British spelling of words like “centre” and “colour.” To check, we can use the following pattern, which matches either “centre” or the American spelling “center.”

pattern = 'cent(er|re)'
        

In this pattern, the parentheses enclose the part of the pattern the vertical bar applies to. So this pattern matches a sequence that starts with 'cent' and ends with either 'er' or 're':

result = find_first(pattern)
result.string
        
'horseshoe of the Carpathians, as if it were the centre of some sort of\n'
        

As expected, he used the British spelling.

We can also check whether he used the British spelling of “colour.” The following pattern uses the special character '?', which means that the previous character is optional:

pattern = 'colou?r'
        

This pattern matches either “colour” with the 'u' or “color” without it:

result = find_first(pattern)
line = result.string
line
        
'undergarment with long double apron, front, and back, of coloured stuff\n'
        

Again, as expected, he used the British spelling.

Now suppose we want to produce an edition of the book with American spellings. We can use the sub function in the re module, which does string substitution:

re.sub(pattern, 'color', line)
        
'undergarment with long double apron, front, and back, of colored stuff\n'
        

The first argument is the pattern we want to find and replace, the second is what we want to replace it with, and the third is the string we want to search. In the result, you can see that “colour” has been replaced with “color.”

Exercises

Ask a Virtual Assistant

In this chapter, we only scratched the surface of what regular expressions can do. To get an idea of what’s possible, ask a virtual assistant, “What are the most common special characters used in Python regular expressions?”

You can also ask for a pattern that matches particular kinds of strings. For example, try asking:

  • “Write a Python regular expression that matches a 10-digit phone number with hyphens.”

  • “Write a Python regular expression that matches a street address with a number and a street name, followed by ST or AVE.”

  • “Write a Python regular expression that matches a full name with any common title like Mr or Mrs, followed by any number of names beginning with capital letters, possibly with hyphens between some names.”

And if you want to see something more complicated, try asking for a regular expression that matches any legal URL.

A regular expression often has the letter r before the quotation mark, which indicates that it is a raw string. For more information, ask a virtual assistant, “What is a raw string in Python?”

Exercise

See if you can write a function that does the same thing as the shell command !head. It should take as arguments the name of a file to read, the number of lines to read, and the name of the file to write the lines into. If the third parameter is None, it should display the lines rather than write them to a file.

Consider asking a virtual assistant for help, but if you do, tell it not to use a with statement or a try statement.

Exercise

“Wordle” is an online word game where the objective is to guess a five-letter word in six or fewer attempts. Each attempt has to be recognized as a word, not including proper nouns. After each attempt, you get information about which of the letters you guessed appear in the target word, and which ones are in the correct position.

For example, suppose the target word is MOWER and you guess TRIED. You would learn that E is in the word and in the correct position, R is in the word but not in the correct position, and T, I, and D are not in the word.

As a different example, suppose you have guessed the words SPADE and CLERK, and you’ve learned that E is in the word, but not in either of those positions, and none of the other letters appear in the word.

Of the words in the word list, how many could be the target word? Write a function called check_word that takes a five-letter word and checks whether it could be the target word.

You can use any of the functions from the previous chapter, like uses_any.

Exercise

Continuing the previous exercise, suppose you guess the word TOTEM and learn that the E is still not in the right place, but the M is. How many words are left?

Exercise

The Count of Monte Cristo is a novel by Alexandre Dumas that is considered a classic. Nevertheless, in the introduction of an English translation of the book, the writer Umberto Eco confesses that he found the book to be “one of the most badly written novels of all time.”

In particular, he says it is “shameless in its repetition of the same adjective,” and mentions in particular the number of times “its characters either shudder or turn pale.”

To see whether his objection is valid, let’s count the number of times the word pale appears in any form, including pale, pales, paled, and paleness, as well as the related word pallor. Use a single regular expression that matches all of these words and no others.