Letter frequencies are counts of the number of times each distinct letter appears in a word or in a sample of text. In the word distinct, for example, the letter frequencies can be expressed using the following compact notation:
c:1, d:1, i:2, n:1, s:1, t:2
The analysis of letter frequencies played an important role in the early history of cryptanalysis. (Given a sample of text of reasonable size—at least about 300 characters—one can use letter frequency analysis to decode simple cyphers).
The focus of this page is letter frequency puzzles. In such a puzzle, information is provided about the number of times each distinct letter appears in a word; frequencies may be provided for specific letters, unnamed letters, or a combination of the two. The goal is to find a word or all words that satisfy the specified frequencies. The following is an example of a letter frequency puzzle of intermediate difficulty:
Find at least three six-letter words, each containing one c, one d, two o's, one r, and one other letter.
Unlike anagrams, which are quite old, letter frequency puzzles are comparatively recent; their origin is unclear. In many languages other than English, the construction of letter frequency puzzles is problematic or completely impractical. In languages such as French, German, and Swedish, the accent marks create difficulties. The large number of characters (thousands) in Chinese precludes letter frequency puzzles in that language.
frequencies.py
The Python program frequencies.py
finds English words having
specified letter frequencies.
dict
(Python dictionary), which is a series of key-value pairs. If you are running
the offline version of the code—this version can be downloaded using a
link below—this specification may appear on the command line; if it does
not, the program prompts the user to enter it interactively.
In each key-value pair, the key must be one of the following:
a single letter, optionally enclosed in quotes. In this case the frequency specified by the value (see below) applies to that specific letter.
an integer, in which case the frequency applies to an unspecified letter that must be distinct from the letters associated with other frequencies.
The value must be one of the following:
a positive integer. This is the number of times that some letter--either the letter specified via the key or some other unique letter if the key is an integer-- appears in a matching word.
a range of integers of the form m-n
(two integers
separated by a hyphen). This indicates that the corresponding letter must
appear at least m
times but no more than n
times.
For example, the key-value pair e:3-99
indicates that the letter
e must appear at least 3 and no more than 99 times. The first integer
(m) may equal zero. One may use an asterisk (*) as a shorthand for 0-99, i.e.,
the asterisk indicates that the given letter may appear any number of times
(including zero). One may use a question mark (?) as a shorthand for 0-1, and
a plus sign (+) as a shorthand for 1-99.
Note: This version of the program requires that when one specifies a
frequency as a range, the key must name a specific letter. Thus, for example,
the key-value pair 1:*
would be illegal. (A preliminary version
of a more flexible algorithm has been designed; see below).
To find 6-letter words containing one c, one d, two o's, one r, and one other letter, issue the following command:
python frequencies.py c:1, d:1, o:2, r:1, 1:1
There are four matches: condor, cordon, corody, and doctor.
To find 10-letter words containing one v, two instances of a second letter, three instances of a third letter, and four instances of a fourth letter, issue the following command:
python frequencies.py v:1, 2:2, 3:3, 4:4
There are two matches: evennesses and sleeveless.
To find all words containing three or more e's and any number of m's, n's, p's, and t's, issue the following command:
python frequencies.py e:3-99,m:*,n:*,p:*,t:*
There are six matches: entente, epee, pentene, teepee, tenement, and tepee.
Note: When a letter frequency dict
contains integer-valued keys,
the specific integers must be unique but otherwise do not matter. In the second
example above, we would have obtained the same result by specifying the
following:
python frequencies.py v:1, 5:2, 4:3, 7:4
frequencies.py
OnlineIf you do not have a Python interpreter on your computer and don't want to
install one, you can run frequencies.py
online (i.e., on the
server) by entering a query (letter frequency dict
) into the box,
selecting one of the two spelling dictionaries, and then clicking on the
button below the box.
frequencies2.py
The Python program frequencies2.py
—so called because it is
the second version of this program—removes two limitations of
frequencies.py
:
One may specify a range of frequencies for any key, including keys that represent unspecified letters.
A new key—the asterisk (*)—is now supported. This key is used to represent any letters that are not matched by any of the other keys.
If you have some background in software, you may be wondering how the code
works. frequencies2.py
tests each word in the spelling dictionary
against the user's frequency specification using the following basic algorithm:
The code tests all frequencies associated with named letters. As each named-letter frequency is satisfied, the associated letters are eliminated from the word under test.
The code performs a
recursive
depth-first search to test frequencies associated with integer
keys—recall that each such key represents a specific unnamed
letter—and with the '*'
key (if present)— recall that
this key represents any letters that were not matched by other keys.
The recursive function accepts two arguments:
freqs_target_unmatched
is a dict
containing
key-value frequency pairs to be matched, including only integer-valued keys,
and
freqs_word_unmatched
is a letter frequency
dict
representing the unmatched letters in the current word.
A top-level invocation passes all of the integer keys (and associated target frequencies) and unmatched letters in the current word.
If freqs_target_unmatched
is an empty dictionary, there are no
more frequencies associated with integer keys to be matched. When this happens,
the function checks for the presence of the '*'
key in the original
frequency specification:
If this key is present, any remaining unmatched letters in the word are tested against the associated frequency. If there is a match, the function terminates the search process for the current word by raising a special exception (see below). Otherwise, the function simply returns control to the caller.
If this key is not present and unmatched letters remain, this branch of the search tree has not produced a match and the function simply returns control to the caller.
If this key is not present and no unmatched letters remain, a match has been found, and the function terminates the search process for the current word by raising a special exception (see below).
If at any point in the search process a complete match is found between the frequency specification and all letters of the current word, the code terminates the search process for the current word by raising a special exception. This exception destroys the call stack and informs higher-level code that catches the exception that the search result was successful. (One could make the algorithm work without using an exception to terminate the search process, but running times would be longer).
d:1, r:1, *:1-2
The above query requests all words containing one d, one r, and one or two other letters. If there are two other letters, these may be either distinct or the same. When this query is run against the large spelling dictionary, the code reports 109 matches.
The following somewhat more interesting query finds words in which four distinct letters each occur repeatedly, with up to two other letters appearing in the word:
1:2-99, 2:2-99, 3:2-99, 4:2-99, *:0-2
When this query is run against the small spelling dictionary, the code reports the following 30 matches:
'arranging', 'attendance', 'classical', 'concession', 'concessions', 'conciseness', 'contractor', 'conviction', 'devastated', 'devastates', 'efficiency', 'embarrasses', 'engineering', 'inefficient', 'insistent', 'instincts', 'institutions', 'loneliness', 'management', 'navigating', 'regretting', 'satellites', 'sentiments', 'signalling', 'statistical', 'structures', 'teammate', 'testifies', 'throughout', and 'tightening'
frequencies2.py
OnlineTo run frequencies2.py
online, enter a query (letter frequency
dict
) in the box, select one of the two spelling dictionaries, and
then click on the button below the box. Note: Times to run the default query
against the small and large spelling dictionaries are roughly 5 and 184 seconds,
respectively. When performing queries involving multiple unspecified letters,
one should either use the small spelling dictionary or be prepared to be
patient. I plan to investigate ways of speeding up
frequencies2.py
.
Formulate a query that will match all words containing a single e, a single r, a single u, and no more than two other letters. If there are two other letters, they must be the same. Hint: The correct query yields 12 matches with the small spelling dictionary. If you are stumped, scroll down to the end of this page for the answer.
Is it possible with a single query to find all words that contain equal numbers of instances of two distinct letters? If not, how would you solve this problem?
The Python program vowels.py
finds all words having a specified
length and a specified number of vowels. For purposes of this program, a vowel
is one of the following:
The correct answer is r:1,u:1,e:1,1:0-2
. Note that
r:1,u:1,e:1,*:0-2
is not correct because some of the five-letter
words matched by this query have five distinct letters. Words that match the
correct query are a subset of those that match the incorrect one.
frequencies2.py
does not permit one to answer this question
via a single query. One could write a (very simple) specialized solver for this
problem, or solve the problem by configuring frequencies2.py
to use
the large spelling dictionary and then issuing the following queries (results
from each query appear immediately after the query):
python frequencies2.py 1:1,2:1 There are 95 matches: 'ad', 'ae', 'ag', 'ah', 'ai', 'al', 'am', 'an', 'ar', 'as', 'at', 'aw', 'ax', 'ay', 'ba', 'be', 'bi', 'bo', 'by', 'de', 'do', 'du', 'ed', 'ef', 'eh', 'el', 'em', 'en', 'er', 'es', 'et', 'ex', 'fa', 'go', 'ha', 'he', 'hi', 'hm', 'ho', 'id', 'if', 'in', 'is', 'it', 'jo', 'ka', 'la', 'li', 'lo', 'ma', 'me', 'mi', 'mo', 'mu', 'my', 'na', 'ne', 'no', 'nu', 'od', 'oe', 'of', 'oh', 'om', 'on', 'op', 'or', 'os', 'ow', 'ox', 'oy', 'pa', 'pe', 'pi', 're', 'sh', 'si', 'so', 'ta', 'ti', 'to', 'uh', 'uk', 'um', 'un', 'up', 'us', 'ut', 'we', 'wo', 'xi', 'xu', 'ya', 'ye', 'yo' python frequencies2.py 1:2,2:2 There are 32 matches: 'abba', 'anna', 'baba', 'boob', 'coco', 'dada', 'deed', 'dodo', 'eses', 'gaga', 'gogo', 'haha', 'juju', 'kaka', 'keek', 'lulu', 'mama', 'meme', 'mumu' , 'naan', 'nana', 'nene', 'noon', 'otto', 'papa', 'peep', 'poop', 'sees', 'tete', 'titi', 'toot', 'tutu' python frequencies2.py 1:3,2:3 There is one match: deeded
Last update: 28 April, 2013