Programs that solve word puzzles typically read one or more files, and may also read interactive input. One input file that is always required is an English language spelling dictionary. (Other inputs would describe the specific problem to be solved, unless all of this information has been hard-coded into the program). The following spelling dictionaries should contain no abbreviations, acronyms, hyphenated words, slang/informal words, colloquial spellings, or words that begin with a capital letter. (The one exception is the pronoun I, which appears in lower-case). They should include both American and British spellings. If any exceptions to these rules are found, or to have a word added, please notify me.
For purposes of deciding whether a word is slang/informal or colloquial, I have used dictionary.com, www.thefreedictionary.com, and merriam-webster.com. If at least two of these three sources list a word as slang, I will generally exclude it.
Over a period of about 12 years, I constructed a 72,000-word spelling dictionary by hand. Eventually, I combined it with several other spelling dictionaries that I found on the Web to create a much larger spelling dictionary. I am gradually weeding out slang words, but a significant number remain.
A smaller spelling dictionary is sometimes preferrable to a larger one:
Some algorithms—in particular, those for which running time is proportional to the size of the dictionary—may require too much running time when operating on a large spelling dictionary.
Some algorithms will produce a plethora of solutions when one uses a
large spelling dictionary. (See the discussion of the multi-word anagram
Children have more enjoyment from word games and puzzles when most of the words are known to them and other words are easily explainable.
small.txt is a small, hand-generated spelling dictionary that
has been designed to be reasonably kid-safe but otherwise contain most of the
non-slang, non-technical words that one might expect in the
vocabulary of a well-educated 16- to 18-year-old (and a few that one
probably wouldn't expect). I've decided—admittedly somewhat
arbitrarily—to limit the size of this dictionary to 15,200 words. (This
is roughly 8.4 percent of the number of words in the large dictionary). The
small dictionary is a subset of the large one, i.e., any word that appears in
the small dictionary also appears in the large one.
Short words— especially those having six letters or fewer—tend to take precedence over longer words. Among words having more than six letters, I have favored (a) "concept words", (b) action words, and (c) words that describe people, things, and actions over (d) words that name things. Thus, the word evolutionary appears, while the word nucleotide does not.
To request a correction, send me e-mail.
To create space for other words judged to be more valuable, many words that
are neither slang nor acronyms have been excluded from
and placed in the file
overflow.txt instead. These words have more
than six letters, are almost always forms of words appearing in the small
spelling dictionary, and include mainly words of the following types:
adverbs ending with the suffix -ly,
nouns ending with the suffix -ness,
regular noun plurals, and
verb forms ending with the suffixes -s, -ed, and -ing.
many -ing words that can function as adjectives, such as appealing,
many -ing words that can function as nouns, such as serving and setting,
a few plural nouns such as entrails that are used almost exclusively in the plural form, and
a few words, e.g., dastardly, that have been retained in the
small dictionary because they are solutions to Word Match
puzzles. (Word Match puzzle solutions come exclusively from the file
Although I do not have a middle-sized spelling dictionary, a spelling
dictionary containing over 24,300 words can be created by simply uniting
overflow.txt. These two files are
guaranteed to be disjoint, i.e., any word that appears in one does not appear in
the other. Furthermore, words in these two files are a subset of those in the
large spelling dictionary.
When solving many types of word puzzles (e.g., anagrams or
word ladders) via
software, one must be able to determine whether a given sequence of letters is a
valid word. In such puzzle solvers, the efficiency of this test is critical
because it will be performed a very large number of times. The naive approach is
to store the spelling dictionary in a
list data structure, but this
leads to gross inefficiency. If one is checking a sequence of letters that
(a) is not a valid word or (b) is a valid word appearing near the end of the
dictionary, the time required to perform the test will be proportional to
n, the number of words in the dictionary. Computer scientists say that
such an algorithm has performance that is O(n).
The ideal data structure for this application is Python's
data structure. Because the
set is implemented using a hash table, the average
time to perform a lookup (of a valid or invalid) word is essentially constant,
i.e., independent of n. Computer scientists say that such an algorithm
has performance that is O(1).
When n is large, the difference between O(n) and
O(1) can be very significant. The following input/output is
iPython session in which I stored the large spelling
dictionary in two data structures—a
list and a
set—and measured the time to look up the word
zeitgeist in each. (I picked this word because it appears near the end
of the dictionary). Results show that for a spelling dictionary of this
size—181,000 words—runtime performance is improved by a factor of
roughly 44,000 by using a
set rather than a
In : with open('large.txt', 'r') as FILE: ...: words= [ word.rstrip() for word in FILE ] In : timeit 'zeitgeist' in words 100 loops, best of 3: 4.06 ms per loop In : words= set(words) In : timeit 'zeitgeist' in words 10000000 loops, best of 3: 91.7 ns per loop
When working with large word lists, it is often useful to be able to perform various types of automated comparisons and tests. The following Python scripts operate on plain text files containing exactly one word per line.
The Python program
difference.py compares the specified two
files and reports all words that appear in the first file but not in the
second. If, for example, you have a word list called
and want to check that those words are valid, you would issue the following
python difference.py mywords.txt large.txt
large.txt is the large spelling dictionary. (For this
command to work, you must have a working installation of Python on your
computer, and both input files must reside in the working folder).
The Python program
intersection.py compares the specified two
files and reports all words appearing in the both files.
The Python program
duplicates.py analyses a single input file
and reports all words that appear more than once.
2 August, 2013: I've added a section on Python utility programs to this page.
25 July, 2013: Over the last two months, with the goal of improving the educational value of the small dictionary for high-school-age students, I removed over 4,000 words, replacing them with an equal number of vocabulary words of the level that one might expect on such examinations as the SAT or GRE. (Words that were removed from the small dictionary have been stored in the overflow file; they were forms of other words that have remained in the small dictionary).
Last update: 19 June, 2017