Solving Word Puzzles via Regular Expression Matching

Dr. Phillip M. Feldman


1. Introduction

Regular expressions, known as regex's for short, allow one to compactly describe a wide range of text patterns. The Python computer language's standard library includes a module called re that efficiently performs various operations involving regex's, e.g., finding substrings that match a given regex. Several types of word puzzles can be solved by formulating a suitable regex and then testing it against each word in a spelling dictionary, or by using regex's in combination with other techniques. The following types of word puzzles are easily solved using regex's alone:

  1. word puzzles that call for finding a word or all words subject to restrictions on what letters may appear in specified positions within the word,

  2. word puzzles that involve restrictions on what letters may follow or precede other letters or sequences of letters, and

  3. combinations of 1. and 2.

I've written a Python program called regex.py that tests every word in the specified dictionary against a user-specified regex and reports all words that match the regex. This program is available for download, and one can also run it online (see below).

It is worth noting that there are many types of word puzzles that cannot be solved using regex's. Some of these can be solving using algorithms from the standard repertoire of the computer scientist, e.g., letter frequency puzzles are solved via depth-first search. Some word puzzles require sophisticated, special-purpose algorithms.


2. Examples and Exercises for the Reader

2.1 Regex example #1

[mn][aeiou]t$

The above regex describes a text pattern in which the first letter is m or n, the second letter is a vowel, and the third and last letter is t. Each pair of square brackets defines a character class. We could have enclosed the t in square brackets as well, but the brackets are unnecessary for a character class containing only a single letter. The dollar sign at the end of the regex "anchors" the pattern to the end of the string; in plain English, this means that additional letters may not appear after the t.

When the above regex is applied to words in the large spelling dictionary, 8 matches are produced: 'mat', 'met', 'mot', 'mut', 'net', 'nit', 'not', and 'nut'. Note that the dollar sign at the end of the regex is critical; if one omits it, one obtains 1,441 matches instead of 8.

2.2 Regex example #2

   .*iai
The dot (.) is a metacharacter that matches any single character. For our purposes, the dot is equivalent to the character class [a-z], which includes all letters. The asterisk (*) indicates that the immediately preceding regex element appears zero or more times. So, the .* combination represents an arbitrary sequence of letters. The absence of a dollar sign at the end of the regex indicates that the pattern may be followed by an arbitrary sequence of letters.

Note that there is a certain lack of symmetry in the regex syntax: To indicate that a pattern may be preceded by an arbitrary sequence of letters, one must begin the regex with .* (a dot followed by an asterisk); to indicate that the pattern may be followed by an arbitrary sequence of letters, one omits the dollar sign that would otherwise end the regex.

When the above regex is applied to words in the small spelling dictionary, there are no matches. The large spelling dictionary produces 16 matches: 'antiair', 'antiaircraft', 'antiaircrafts', 'campigliaite', 'gravegliaite', 'gugiaite', 'liaise', 'liaised', 'liaises', 'liaising', 'liaison', 'liaisons', 'reppiaite', 'rosiaite', 'tuperssuatsiaite', and 'versiliaite'. Note that 7 of these 16 words are names of minerals.

2.3 Regex example #3

   (f|p|qu|s|sp)(ea|ee|oo)(l|n|t)$

The above regex matches words that can be decomposed into three parts:

  1. The word must begin with f, p, qu, s, or sp.

  2. The middle part of the word must be ea, ee or oo.

  3. The word must end with l, n, or t.

This regex produces 17 matches with the small spelling dictionary: 'feat', 'feel', 'feet', 'fool', 'foot', 'peal', 'peat', 'peel', 'pool', 'queen', 'seal', 'seat', 'seen', 'soon', 'soot', 'spool', and 'spoon'

2.4 Regex example #4

One can search for words that contain ei not immediately preceded by c via the following regex:

   .*(?<!c)ei

The special expression (?<!...), where the three dots represent an arbitrary subpattern, matches if the current position in the string is not preceded by a match to the subpattern. This is called negative lookbehind.

A variation on the above regex limits the number of letters appearing before the ei to at most two and places the same restriction on the number of letters appearing after:

   .{0,2}(?<!c)ei.{0,2}$

The above regex produces nine matches with the small dictionary: 'beige', 'being', 'eyeing', 'reign', 'seeing', 'seize', 'their', 'weigh', and 'weird'

2.5 Exercise #1 for the Reader

Explain why the following two regular expressions produce different results:

   .*(?<!c)ei
   .*[^c]ei

2.6 Exercise #2 for the Reader

Construct a regex that is similar to the second one in Exercise #1 (does not use negative lookbehind), but that produces the same results as the first regex.

2.7 Regex example #5

   ([aeiou][^aeiou]){7,}[aeiou]?$

This moderately complex regex contains an expression in parentheses consisting of two character classes. The first, [aeiou], is the class of all vowels. Note that the second, [^aeiou], begins with a circumflex (^). This indicates that the character class is the complement of the class containing the specified characters, i.e., the class of consonants. The parentheses indicate that the sequence of these two character classes should be treated as a group that will match a vowel immediately followed by a consonant. The expression {7,} immediately following the group indicates that 7 or more consecutive instances of this group must be matched. The regex ends with [aeiou]?, which indicates an optional additional vowel, followed by a dollar sign anchoring the pattern to the end of the string. The entire regex describes a text pattern consisting of alternating vowels and consonants, with a minimum of 14 letters in the word.

This regex yields seven matches with the large dictionary: 'acetazolamides', 'aluminoceladonite', 'aluminosilicate', 'aluminosilicates', 'overelaborated', 'overelaborates', and 'overimaginative'. Note that the first four matches are terms from chemistry. The last three might be a subtle hint to the author of this document.

2.8 Exercise #3 for the Reader

Construct a regex that will match all words such that (a) the sequence ei appears at least twice and (b) the letters l, m, and n do not appear at all. Hint: If you have found the correct regex, it will match exactly three words in the large spelling dictionary.

3. Python Source Code and Online Solver

Download regex.py

To run regex.py online, enter a regular expression into the box, select one of the two spelling dictionaries, and then click on the button below the box.



Small spelling dictionary Large spelling dictionary



4. Solutions to Exercises for the Reader

  1. One can easily verify that the following two regular expressions produce different results:

       .*(?<!c)ei
       .*[^c]ei
    

    To understand why, note that the second regex requires that the ei be preceded by at least one letter (the .* can match anything or nothing, but the [^c] must match a single character). Thus, the first regex will match a word such as either that begins with ei while the second will not. Words produced by the second regex are a subset of those produced by the first.

  2. The solution is to make the portion of the pattern appearing before the ei optional by enclosing it in parentheses and placing a question mark immediately after the closing parenthesis:

    (.*[^c])?ei
    
  3. To exclude the letters l, m, and n, one must use a character class. The following regex uses this character class three times:

    [^lmn]*ei[^lmn]*ei[^lmn]*$
    

Last update: 8 April, 2017