Letter Frequency Puzzles

Dr. Phillip M. Feldman


0. Overview

Letter frequencies are counts of the number of times each distinct letter appears in a word or in a sample of text. In the word distinct, for example, the letter frequencies can be expressed using the following compact notation:

   c:1, d:1, i:2, n:1, s:1, t:2

The analysis of letter frequencies played an important role in the early history of cryptanalysis. (Given a sample of text of reasonable size—at least about 300 characters—one can use letter frequency analysis to decode simple cyphers).

The focus of this page is letter frequency puzzles. In such a puzzle, information is provided about the number of times each distinct letter appears in a word; frequencies may be provided for specific letters, unnamed letters, or a combination of the two. The goal is to find a word or all words that satisfy the specified frequencies. The following is an example of a letter frequency puzzle of intermediate difficulty:

Find at least three six-letter words, each containing one c, one d, two o's, one r, and one other letter.

Unlike anagrams, which are quite old, letter frequency puzzles are comparatively recent; their origin is unclear. In many languages other than English, the construction of letter frequency puzzles is problematic or completely impractical. In languages such as French, German, and Swedish, the accent marks create difficulties. The large number of characters (thousands) in Chinese precludes letter frequency puzzles in that language.

1. frequencies.py

1.1 Introduction

The Python program frequencies.py finds English words having specified letter frequencies.

1.2 How to Specify Letter Frequencies

The user specifies the desired letter frequencies via a dict (Python dictionary), which is a series of key-value pairs. If you are running the offline version of the code—this version can be downloaded using a link below—this specification may appear on the command line; if it does not, the program prompts the user to enter it interactively.

In each key-value pair, the key must be one of the following:

The value must be one of the following:

Note: This version of the program requires that when one specifies a frequency as a range, the key must name a specific letter. Thus, for example, the key-value pair 1:* would be illegal. (A preliminary version of a more flexible algorithm has been designed; see below).

1.3 Sample Queries

  1. To find 6-letter words containing one c, one d, two o's, one r, and one other letter, issue the following command:

       python frequencies.py c:1, d:1, o:2, r:1, 1:1
    

    There are four matches: condor, cordon, corody, and doctor.

  2. To find 10-letter words containing one v, two instances of a second letter, three instances of a third letter, and four instances of a fourth letter, issue the following command:

       python frequencies.py v:1, 2:2, 3:3, 4:4
    

    There are two matches: evennesses and sleeveless.

  3. To find all words containing three or more e's and any number of m's, n's, p's, and t's, issue the following command:

       python frequencies.py e:3-99,m:*,n:*,p:*,t:*
    

    There are six matches: entente, epee, pentene, teepee, tenement, and tepee.

Note: When a letter frequency dict contains integer-valued keys, the specific integers must be unique but otherwise do not matter. In the second example above, we would have obtained the same result by specifying the following:

   python frequencies.py v:1, 5:2, 4:3, 7:4

1.4 Download the Offline Version

Download frequencies.py

1.5 Run frequencies.py Online

If you do not have a Python interpreter on your computer and don't want to install one, you can run frequencies.py online (i.e., on the server) by entering a query (letter frequency dict) into the box, selecting one of the two spelling dictionaries, and then clicking on the button below the box.



Small spelling dictionary Large spelling dictionary


2. frequencies2.py

2.1 Introduction

The Python program frequencies2.py—so called because it is the second version of this program—removes two limitations of frequencies.py:

2.2 The Algorithm

If you have some background in software, you may be wondering how the code works. frequencies2.py tests each word in the spelling dictionary against the user's frequency specification using the following basic algorithm:

  1. The code tests all frequencies associated with named letters. As each named-letter frequency is satisfied, the associated letters are eliminated from the word under test.

  2. The code performs a recursive depth-first search to test frequencies associated with integer keys—recall that each such key represents a specific unnamed letter—and with the '*' key (if present)— recall that this key represents any letters that were not matched by other keys.

    The recursive function accepts two arguments:

    A top-level invocation passes all of the integer keys (and associated target frequencies) and unmatched letters in the current word.

    If freqs_target_unmatched is an empty dictionary, there are no more frequencies associated with integer keys to be matched. When this happens, the function checks for the presence of the '*' key in the original frequency specification:

    If at any point in the search process a complete match is found between the frequency specification and all letters of the current word, the code terminates the search process for the current word by raising a special exception. This exception destroys the call stack and informs higher-level code that catches the exception that the search result was successful. (One could make the algorithm work without using an exception to terminate the search process, but running times would be longer).

2.3 Sample Queries

Here is a sample query:

   d:1, r:1, *:1-2

The above query requests all words containing one d, one r, and one or two other letters. If there are two other letters, these may be either distinct or the same. When this query is run against the large spelling dictionary, the code reports 109 matches.

The following somewhat more interesting query finds words in which four distinct letters each occur repeatedly, with up to two other letters appearing in the word:

   1:2-99, 2:2-99, 3:2-99, 4:2-99, *:0-2

When this query is run against the small spelling dictionary, the code reports the following 30 matches:

'arranging', 'attendance', 'classical', 'concession', 'concessions', 'conciseness', 'contractor', 'conviction', 'devastated', 'devastates', 'efficiency', 'embarrasses', 'engineering', 'inefficient', 'insistent', 'instincts', 'institutions', 'loneliness', 'management', 'navigating', 'regretting', 'satellites', 'sentiments', 'signalling', 'statistical', 'structures', 'teammate', 'testifies', 'throughout', and 'tightening'

2.4 Running frequencies2.py Online

To run frequencies2.py online, enter a query (letter frequency dict) in the box, select one of the two spelling dictionaries, and then click on the button below the box. Note: Times to run the default query against the small and large spelling dictionaries are roughly 5 and 184 seconds, respectively. When performing queries involving multiple unspecified letters, one should either use the small spelling dictionary or be prepared to be patient. I plan to investigate ways of speeding up frequencies2.py.



Small spelling dictionary Large spelling dictionary


2.5 Exercises for the Reader

  1. Formulate a query that will match all words containing a single e, a single r, a single u, and no more than two other letters. If there are two other letters, they must be the same. Hint: The correct query yields 12 matches with the small spelling dictionary. If you are stumped, scroll down to the end of this page for the answer.

  2. Is it possible with a single query to find all words that contain equal numbers of instances of two distinct letters? If not, how would you solve this problem?



3. Words Having Specified Numbers of Vowels

The Python program vowels.py finds all words having a specified length and a specified number of vowels. For purposes of this program, a vowel is one of the following:

Download vowels.py


4. Answers to Exercises

  1. The correct answer is r:1,u:1,e:1,1:0-2. Note that r:1,u:1,e:1,*:0-2 is not correct because some of the five-letter words matched by this query have five distinct letters. Words that match the correct query are a subset of those that match the incorrect one.

  2. frequencies2.py does not permit one to answer this question via a single query. One could write a (very simple) specialized solver for this problem, or solve the problem by configuring frequencies2.py to use the large spelling dictionary and then issuing the following queries (results from each query appear immediately after the query):

    python frequencies2.py 1:1,2:1
    
    There are 95 matches: 'ad', 'ae', 'ag', 'ah', 'ai', 'al', 'am', 'an', 'ar',
    'as', 'at', 'aw', 'ax', 'ay', 'ba', 'be', 'bi', 'bo', 'by', 'de', 'do', 'du',
    'ed', 'ef', 'eh', 'el', 'em', 'en', 'er', 'es', 'et', 'ex', 'fa', 'go', 'ha',
    'he', 'hi', 'hm', 'ho', 'id', 'if', 'in', 'is', 'it', 'jo', 'ka', 'la', 'li',
    'lo', 'ma', 'me', 'mi', 'mo', 'mu', 'my', 'na', 'ne', 'no', 'nu', 'od', 'oe',
    'of', 'oh', 'om', 'on', 'op', 'or', 'os', 'ow', 'ox', 'oy', 'pa', 'pe', 'pi',
    're', 'sh', 'si', 'so', 'ta', 'ti', 'to', 'uh', 'uk', 'um', 'un', 'up', 'us',
    'ut', 'we', 'wo', 'xi', 'xu', 'ya', 'ye', 'yo'
    
    python frequencies2.py 1:2,2:2
    
    There are 32 matches: 'abba', 'anna', 'baba', 'boob', 'coco', 'dada', 'deed',
    'dodo', 'eses', 'gaga', 'gogo', 'haha', 'juju', 'kaka', 'keek', 'lulu', 'mama',
    'meme', 'mumu' , 'naan', 'nana', 'nene', 'noon', 'otto', 'papa', 'peep', 'poop',
    'sees', 'tete', 'titi', 'toot', 'tutu'
    
    python frequencies2.py 1:3,2:3
    
    There is one match: deeded
    

Last update: 28 April, 2013