Regular
expressions, known as regex's for short, allow one to compactly
describe a wide range of text patterns. The
Python computer language's standard library includes a module called re
that
efficiently performs various operations involving regex's, e.g., finding
substrings that match a given regex. Several types of word puzzles can be
solved by formulating a suitable regex and then testing it against each word in
a spelling dictionary, or by using regex's in combination with other techniques.
The following types of word puzzles are easily solved using regex's alone:
word puzzles that call for finding a word or all words subject to restrictions on what letters may appear in specified positions within the word,
word puzzles that involve restrictions on what letters may follow or precede other letters or sequences of letters, and
combinations of 1. and 2.
I've written a Python program called regex.py
that tests every
word in the specified dictionary against a user-specified regex and reports all
words that match the regex. This program is available for download, and one can
also run it online (see below).
It is worth noting that there are many types of word puzzles that cannot be solved using regex's. Some of these can be solving using algorithms from the standard repertoire of the computer scientist, e.g., letter frequency puzzles are solved via depth-first search. Some word puzzles require sophisticated, special-purpose algorithms.
[mn][aeiou]t$
The above regex describes a text pattern in which the first letter is m or n, the second letter is a vowel, and the third and last letter is t. Each pair of square brackets defines a character class. We could have enclosed the t in square brackets as well, but the brackets are unnecessary for a character class containing only a single letter. The dollar sign at the end of the regex "anchors" the pattern to the end of the string; in plain English, this means that additional letters may not appear after the t.
When the above regex is applied to words in the large spelling dictionary, 8 matches are produced: 'mat', 'met', 'mot', 'mut', 'net', 'nit', 'not', and 'nut'. Note that the dollar sign at the end of the regex is critical; if one omits it, one obtains 1,441 matches instead of 8.
.*iaiThe dot (.) is a metacharacter that matches any single character. For our purposes, the dot is equivalent to the character class
[a-z]
,
which includes all letters. The asterisk (*) indicates that the immediately
preceding regex element appears zero or more times. So, the .*
combination represents an arbitrary sequence of letters. The absence of a
dollar sign at the end of the regex indicates that the pattern may be followed
by an arbitrary sequence of letters.
Note that there is a certain lack of symmetry in the regex syntax: To
indicate that a pattern may be preceded by an arbitrary sequence of letters, one
must begin the regex with .*
(a dot followed by an asterisk); to
indicate that the pattern may be followed by an arbitrary sequence of letters,
one omits the dollar sign that would otherwise end the regex.
When the above regex is applied to words in the small spelling dictionary, there are no matches. The large spelling dictionary produces 16 matches: 'antiair', 'antiaircraft', 'antiaircrafts', 'campigliaite', 'gravegliaite', 'gugiaite', 'liaise', 'liaised', 'liaises', 'liaising', 'liaison', 'liaisons', 'reppiaite', 'rosiaite', 'tuperssuatsiaite', and 'versiliaite'. Note that 7 of these 16 words are names of minerals.
(f|p|qu|s|sp)(ea|ee|oo)(l|n|t)$
The above regex matches words that can be decomposed into three parts:
The word must begin with f, p, qu, s, or sp.
The middle part of the word must be ea, ee or oo.
The word must end with l, n, or t.
This regex produces 17 matches with the small spelling dictionary: 'feat', 'feel', 'feet', 'fool', 'foot', 'peal', 'peat', 'peel', 'pool', 'queen', 'seal', 'seat', 'seen', 'soon', 'soot', 'spool', and 'spoon'
One can search for words that contain ei not immediately preceded by c via the following regex:
.*(?<!c)ei
The special expression (?<!...)
, where the three dots
represent an arbitrary subpattern, matches if the current position in the string
is not preceded by a match to the subpattern. This is called negative
lookbehind.
A variation on the above regex limits the number of letters appearing before the ei to at most two and places the same restriction on the number of letters appearing after:
.{0,2}(?<!c)ei.{0,2}$
The above regex produces nine matches with the small dictionary: 'beige', 'being', 'eyeing', 'reign', 'seeing', 'seize', 'their', 'weigh', and 'weird'
Explain why the following two regular expressions produce different results:
.*(?<!c)ei .*[^c]ei
Construct a regex that is similar to the second one in Exercise #1 (does not use negative lookbehind), but that produces the same results as the first regex.
([aeiou][^aeiou]){7,}[aeiou]?$
This moderately complex regex contains an expression in parentheses
consisting of two character classes. The first, [aeiou]
, is the
class of all vowels. Note that the second, [^aeiou]
, begins with
a circumflex (^). This indicates that the character class is the complement of
the class containing the specified characters, i.e., the class of consonants.
The parentheses indicate that the sequence of these two character classes
should be treated as a group that will match a vowel immediately followed by a
consonant. The expression {7,}
immediately following the group
indicates that 7 or more consecutive instances of this group must be matched.
The regex ends with [aeiou]?
, which indicates an optional
additional vowel, followed by a dollar sign anchoring the pattern to the end of
the string. The entire regex describes a text pattern consisting of alternating
vowels and consonants, with a minimum of 14 letters in the word.
This regex yields seven matches with the large dictionary: 'acetazolamides', 'aluminoceladonite', 'aluminosilicate', 'aluminosilicates', 'overelaborated', 'overelaborates', and 'overimaginative'. Note that the first four matches are terms from chemistry. The last three might be a subtle hint to the author of this document.
Construct a regex that will match all words such that (a) the sequence
ei appears at least twice and (b) the letters l, m, and
n do not appear at all. Hint: If you have found the correct regex, it
will match exactly three words in the large spelling dictionary.
To run regex.py
online, enter a regular expression into the box,
select one of the two spelling dictionaries, and then click on the button below
the box.
One can easily verify that the following two regular expressions produce different results:
.*(?<!c)ei .*[^c]ei
To understand why, note that the second regex requires that the ei be
preceded by at least one letter (the .*
can match anything or
nothing, but the [^c]
must match a single character). Thus, the
first regex will match a word such as either that begins with ei
while the second will not. Words produced by the second regex are a subset of
those produced by the first.
The solution is to make the portion of the pattern appearing before the
ei
optional by enclosing it in parentheses and placing a question
mark immediately after the closing parenthesis:
(.*[^c])?ei
To exclude the letters l, m, and n, one must use a character class. The following regex uses this character class three times:
[^lmn]*ei[^lmn]*ei[^lmn]*$
Last update: 8 April, 2017