Letter Frequency Puzzles

Dr. Phillip M. Feldman

0. Overview

Letter frequencies are counts of the number of times each distinct letter appears in a word or in a sample of text. In the word distinct, for example, the letter frequencies can be expressed using the following compact notation:

   c:1, d:1, i:2, n:1, s:1, t:2

The analysis of letter frequencies played an important role in the early history of cryptanalysis. (Given a sample of text of reasonable size—at least about 300 characters—one can use letter frequency analysis to decode simple cyphers).

The focus of this page is letter frequency puzzles. In such a puzzle, information is provided about the number of times each distinct letter appears in a word; frequencies may be provided for specific letters, unnamed letters, or a combination of the two. The goal is to find a word or all words that satisfy the specified frequencies. The following is an example of a letter frequency puzzle of intermediate difficulty:

Find at least three six-letter words, each containing one c, one d, two o's, one r, and one other letter.

Unlike anagrams, which are quite old, letter frequency puzzles are comparatively recent; their origin is unclear. In many languages other than English, the construction of letter frequency puzzles is problematic or completely impractical. In languages such as French, German, and Swedish, the accent marks create difficulties. The large number of characters (thousands) in Chinese precludes letter frequency puzzles in that language.

1. `frequencies.py`

1.1 Introduction

The Python program frequencies.py finds English words having specified letter frequencies.

1.2 How to Specify Letter Frequencies

The user specifies the desired letter frequencies via a dict (Python dictionary), which is a series of key-value pairs. If you are running the offline version of the code—this version can be downloaded using a link below—this specification may appear on the command line; if it does not, the program prompts the user to enter it interactively.

In each key-value pair, the key must be one of the following:

a single letter, optionally enclosed in quotes. In this case the frequency specified by the value (see below) applies to that specific letter.
an integer, in which case the frequency applies to an unspecified letter that must be distinct from the letters associated with other frequencies.

The value must be one of the following:

a positive integer. This is the number of times that some letter--either the letter specified via the key or some other unique letter if the key is an integer-- appears in a matching word.
a range of integers of the form m-n (two integers separated by a hyphen). This indicates that the corresponding letter must appear at least m times but no more than n times. For example, the key-value pair e:3-99 indicates that the letter e must appear at least 3 and no more than 99 times. The first integer (m) may equal zero. One may use an asterisk (*) as a shorthand for 0-99, i.e., the asterisk indicates that the given letter may appear any number of times (including zero). One may use a question mark (?) as a shorthand for 0-1, and a plus sign (+) as a shorthand for 1-99.

Note: This version of the program requires that when one specifies a frequency as a range, the key must name a specific letter. Thus, for example, the key-value pair 1:* would be illegal. (A preliminary version of a more flexible algorithm has been designed; see below).

1.3 Sample Queries

To find 6-letter words containing one c, one d, two o's, one r, and one other letter, issue the following command:
```
   python frequencies.py c:1, d:1, o:2, r:1, 1:1
```
There are four matches: condor, cordon, corody, and doctor.
To find 10-letter words containing one v, two instances of a second letter, three instances of a third letter, and four instances of a fourth letter, issue the following command:
```
   python frequencies.py v:1, 2:2, 3:3, 4:4
```
There are two matches: evennesses and sleeveless.
To find all words containing three or more e's and any number of m's, n's, p's, and t's, issue the following command:
```
   python frequencies.py e:3-99,m:*,n:*,p:*,t:*
```
There are six matches: entente, epee, pentene, teepee, tenement, and tepee.

Note: When a letter frequency dict contains integer-valued keys, the specific integers must be unique but otherwise do not matter. In the second example above, we would have obtained the same result by specifying the following:

   python frequencies.py v:1, 5:2, 4:3, 7:4

1.4 Download the Offline Version

Download frequencies.py

1.5 Run `frequencies.py` Online

If you do not have a Python interpreter on your computer and don't want to install one, you can run frequencies.py online (i.e., on the server) by entering a query (letter frequency dict) into the box, selecting one of the two spelling dictionaries, and then clicking on the button below the box.

2. `frequencies2.py`

2.1 Introduction

The Python program frequencies2.py—so called because it is the second version of this program—removes two limitations of frequencies.py:

One may specify a range of frequencies for any key, including keys that represent unspecified letters.
A new key—the asterisk (*)—is now supported. This key is used to represent any letters that are not matched by any of the other keys.

2.2 The Algorithm

If you have some background in software, you may be wondering how the code works. frequencies2.py tests each word in the spelling dictionary against the user's frequency specification using the following basic algorithm:

The code tests all frequencies associated with named letters. As each named-letter frequency is satisfied, the associated letters are eliminated from the word under test.
The code performs a recursive depth-first search to test frequencies associated with integer keys—recall that each such key represents a specific unnamed letter—and with the '*' key (if present)— recall that this key represents any letters that were not matched by other keys.
The recursive function accepts two arguments:
- freqs_target_unmatched is a dict containing key-value frequency pairs to be matched, including only integer-valued keys, and
- freqs_word_unmatched is a letter frequency dict representing the unmatched letters in the current word.
A top-level invocation passes all of the integer keys (and associated target frequencies) and unmatched letters in the current word.
If freqs_target_unmatched is an empty dictionary, there are no more frequencies associated with integer keys to be matched. When this happens, the function checks for the presence of the '*' key in the original frequency specification:
- If this key is present, any remaining unmatched letters in the word are tested against the associated frequency. If there is a match, the function terminates the search process for the current word by raising a special exception (see below). Otherwise, the function simply returns control to the caller.
- If this key is not present and unmatched letters remain, this branch of the search tree has not produced a match and the function simply returns control to the caller.
- If this key is not present and no unmatched letters remain, a match has been found, and the function terminates the search process for the current word by raising a special exception (see below).
If at any point in the search process a complete match is found between the frequency specification and all letters of the current word, the code terminates the search process for the current word by raising a special exception. This exception destroys the call stack and informs higher-level code that catches the exception that the search result was successful. (One could make the algorithm work without using an exception to terminate the search process, but running times would be longer).

2.3 Sample Queries

Here is a sample query:

   d:1, r:1, *:1-2

The above query requests all words containing one d, one r, and one or two other letters. If there are two other letters, these may be either distinct or the same. When this query is run against the large spelling dictionary, the code reports 109 matches.

The following somewhat more interesting query finds words in which four distinct letters each occur repeatedly, with up to two other letters appearing in the word:

   1:2-99, 2:2-99, 3:2-99, 4:2-99, *:0-2

When this query is run against the small spelling dictionary, the code reports the following 30 matches:

'arranging', 'attendance', 'classical', 'concession', 'concessions', 'conciseness', 'contractor', 'conviction', 'devastated', 'devastates', 'efficiency', 'embarrasses', 'engineering', 'inefficient', 'insistent', 'instincts', 'institutions', 'loneliness', 'management', 'navigating', 'regretting', 'satellites', 'sentiments', 'signalling', 'statistical', 'structures', 'teammate', 'testifies', 'throughout', and 'tightening'

2.4 Running `frequencies2.py` Online

To run frequencies2.py online, enter a query (letter frequency dict) in the box, select one of the two spelling dictionaries, and then click on the button below the box. Note: Times to run the default query against the small and large spelling dictionaries are roughly 5 and 184 seconds, respectively. When performing queries involving multiple unspecified letters, one should either use the small spelling dictionary or be prepared to be patient. I plan to investigate ways of speeding up frequencies2.py.

2.5 Exercises for the Reader

Formulate a query that will match all words containing a single e, a single r, a single u, and no more than two other letters. If there are two other letters, they must be the same. Hint: The correct query yields 12 matches with the small spelling dictionary. If you are stumped, scroll down to the end of this page for the answer.
Is it possible with a single query to find all words that contain equal numbers of instances of two distinct letters? If not, how would you solve this problem?

3. Words Having Specified Numbers of Vowels

The Python program vowels.py finds all words having a specified length and a specified number of vowels. For purposes of this program, a vowel is one of the following:

'a', 'e', 'i', 'o', or 'u'
'y' occurring in any position other than the first.

Download vowels.py

4. Answers to Exercises

The correct answer is r:1,u:1,e:1,1:0-2. Note that r:1,u:1,e:1,*:0-2 is not correct because some of the five-letter words matched by this query have five distinct letters. Words that match the correct query are a subset of those that match the incorrect one.

frequencies2.py does not permit one to answer this question via a single query. One could write a (very simple) specialized solver for this problem, or solve the problem by configuring frequencies2.py to use the large spelling dictionary and then issuing the following queries (results from each query appear immediately after the query):

python frequencies2.py 1:1,2:1

There are 95 matches: 'ad', 'ae', 'ag', 'ah', 'ai', 'al', 'am', 'an', 'ar',
'as', 'at', 'aw', 'ax', 'ay', 'ba', 'be', 'bi', 'bo', 'by', 'de', 'do', 'du',
'ed', 'ef', 'eh', 'el', 'em', 'en', 'er', 'es', 'et', 'ex', 'fa', 'go', 'ha',
'he', 'hi', 'hm', 'ho', 'id', 'if', 'in', 'is', 'it', 'jo', 'ka', 'la', 'li',
'lo', 'ma', 'me', 'mi', 'mo', 'mu', 'my', 'na', 'ne', 'no', 'nu', 'od', 'oe',
'of', 'oh', 'om', 'on', 'op', 'or', 'os', 'ow', 'ox', 'oy', 'pa', 'pe', 'pi',
're', 'sh', 'si', 'so', 'ta', 'ti', 'to', 'uh', 'uk', 'um', 'un', 'up', 'us',
'ut', 'we', 'wo', 'xi', 'xu', 'ya', 'ye', 'yo'

python frequencies2.py 1:2,2:2

There are 32 matches: 'abba', 'anna', 'baba', 'boob', 'coco', 'dada', 'deed',
'dodo', 'eses', 'gaga', 'gogo', 'haha', 'juju', 'kaka', 'keek', 'lulu', 'mama',
'meme', 'mumu' , 'naan', 'nana', 'nene', 'noon', 'otto', 'papa', 'peep', 'poop',
'sees', 'tete', 'titi', 'toot', 'tutu'

python frequencies2.py 1:3,2:3

There is one match: deeded

Last update: 28 April, 2013