Programs that solve word puzzles typically read one or more files, and may also read interactive input. One input file that is always required is an English language spelling dictionary. (Other inputs would describe the specific problem to be solved, unless all of this information has been hard-coded into the program). The following spelling dictionaries should contain no abbreviations, acronyms, hyphenated words, slang/informal words, colloquial spellings, or words that begin with a capital letter. (The one exception is the pronoun I, which appears in lower-case). They should include both American and British spellings. If any exceptions to these rules are found, or to have a word added, please notify me.
For purposes of deciding whether a word is slang/informal or colloquial, I have used dictionary.com, www.thefreedictionary.com, and merriam-webster.com. If at least two of these three sources list a word as slang, I will generally exclude it.
Over a period of about 12 years, I constructed a 72,000-word spelling dictionary by hand. Eventually, I combined it with several other spelling dictionaries that I found on the Web to create a much larger spelling dictionary. I am gradually weeding out slang words, but a significant number remain.
A smaller spelling dictionary is sometimes preferrable to a larger one:
Some algorithms—in particular, those for which running time is proportional to the size of the dictionary—may require too much running time when operating on a large spelling dictionary.
Some algorithms will produce a plethora of solutions when one uses a
large spelling dictionary. (See the discussion of the multi-word anagram
solver anagrams3.py
).
Children have more enjoyment from word games and puzzles when most of the words are known to them and other words are easily explainable.
small.txt
is a small, hand-generated spelling dictionary that has been designed
to contain most of the non-slang, non-technical words that one might expect in the vocabulary of a
well-educated 18-year-old (and a few that one probably wouldn't expect). I've decided
—admittedly somewhat arbitrarily—to limit the size of this dictionary to under
16,000 words. The small dictionary is a subset of the large one, i.e., any word that appears
in the small dictionary also appears in the large one.
Short words— especially those having six letters or fewer—tend to take precedence over longer words. Among words having more than six letters, I have favored (a) "concept words", (b) action words, and (c) words that describe people, things, and actions over (d) words that name things. Thus, the word evolutionary appears, while the word nucleotide does not.
To request a correction, send me e-mail.
To create space for other words judged to be more valuable, many words that are neither slang
nor acronyms have been excluded from small.txt
and placed in the file
overflow.txt
instead. These words have more than six letters, are almost always
forms of words appearing in the small spelling dictionary, and include mainly words of the
following types:
adverbs ending with the suffix -ly,
nouns ending with the suffix -ness,
regular noun plurals, and
verb forms ending with the suffixes -s, -ed, and -ing.
many -ing words that can function as adjectives, such as appealing,
many -ing words that can function as nouns, such as serving and setting,
a few plural nouns such as entrails that are used almost exclusively in the plural form, and
a few words, e.g., dastardly, that have been retained in the small dictionary
because they are solutions to Word
Match puzzles. (Word Match puzzle solutions come exclusively from the file
small.txt
).
Although I do not have a middle-sized spelling dictionary, a spelling dictionary containing
over 25,400 words can be created by simply uniting small.txt
and
overflow.txt
. These two files are guaranteed to be disjoint, i.e., any word that
appears in one does not appear in the other. Furthermore, words in these two files are a subset
of those in the large spelling dictionary.
When solving many types of word puzzles (e.g., anagrams or word ladders) via software, one must
be able to determine whether a given sequence of letters is a valid word. In such puzzle
solvers, the efficiency of this test is critical because it will be performed a very large
number of times. The naive approach is to store the spelling dictionary in a list
data structure, but this leads to gross inefficiency. If one is checking a sequence of letters
that (a) is not a valid word or (b) is a valid word appearing near the end of the dictionary,
the time required to perform the test will be proportional to n, the number of words in
the dictionary. Computer scientists say that such an algorithm has performance that is
O(n).
The ideal data structure for this application is Python's set
data structure.
Because the set
is implemented using a hash table, the average time to perform a
lookup (of a valid or invalid) word is essentially constant, i.e., independent of n.
Computer scientists say that such an algorithm has performance that is O(1).
When n is large, the difference between O(n) and
O(1) can be very significant. The following input/output is
from an iPython
session in which I stored the large spelling
dictionary in two data structures—a list
and a
set
—and measured the time to look up the word
zeitgeist in each. (I picked this word because it appears near the end
of the dictionary). Results show that for a spelling dictionary of this
size—181,000 words—runtime performance is improved by a factor of
roughly 44,000 by using a set
rather than a list
!
In [1]: with open('large.txt', 'r') as FILE: ...: words= [ word.rstrip() for word in FILE ] In [2]: timeit 'zeitgeist' in words 100 loops, best of 3: 4.06 ms per loop In [3]: words= set(words) In [4]: timeit 'zeitgeist' in words 10000000 loops, best of 3: 91.7 ns per loop
When working with large word lists, it is often useful to be able to perform various types of automated comparisons and tests. The following Python scripts operate on plain text files containing exactly one word per line.
The Python program difference.py
compares the specified two files and reports
all words that appear in the first file but not in the second. If, for example, you have a word
list called mywords.txt
and want to check that those words are valid, you would
issue the following command:
python difference.py mywords.txt large.txt
where large.txt
is the large spelling dictionary. (For this command to work, you
must have a working installation of Python on your computer, and both input files must reside in
the working folder).
The Python program intersection.py
compares the specified two files and reports
all words appearing in the both files.
The Python program duplicates.py
analyses a single input file and reports all
words that appear more than once.
2 August, 2013: I've added a section on Python utility programs to this page.
25 July, 2013: Over the last two months, with the goal of improving the educational value of the small dictionary for high-school-age students, I removed over 4,000 words, replacing them with an equal number of vocabulary words of the level that one might expect on such examinations as the SAT or GRE. (Words that were removed from the small dictionary have been stored in the overflow file; they were forms of other words that have remained in the small dictionary).
Last update: 28 Sep., 2024