English Spelling Dictionaries (and How to Use Them in Python)

English Spelling Dictionaries (and How to Use Them in Python)

Dr. Phillip M. Feldman

1. Introduction

Programs that solve word puzzles typically read one or more files, and may also read interactive input. One input file that is always required is an English language spelling dictionary. (Other inputs would describe the specific problem to be solved, unless all of this information has been hard-coded into the program). The following spelling dictionaries should contain no abbreviations, acronyms, hyphenated words, slang/informal words, colloquial spellings, or words that begin with a capital letter. (The one exception is the pronoun I, which appears in lower-case). They should include both American and British spellings. If any exceptions to these rules are found, or to have a word added, please notify me.

For purposes of deciding whether a word is slang/informal or colloquial, I have used dictionary.com, www.thefreedictionary.com, and merriam-webster.com. If at least two of these three sources list a word as slang, I will generally exclude it.

2. A Large Spelling Dictionary

Over a period of about 12 years, I constructed a 72,000-word spelling dictionary by hand. Eventually, I combined it with several other spelling dictionaries that I found on the Web to create a much larger spelling dictionary. I am gradually weeding out slang words, but a significant number remain.

Large spelling dictionary (181,000 words)

3. A Small, Kid-Safe Spelling Dictionary

A smaller spelling dictionary is sometimes preferrable to a larger one:

small.txt is a small, hand-generated spelling dictionary that has been designed to be reasonably kid-safe but otherwise contain most of the non-slang, non-technical words that one might expect in the vocabulary of a well-educated 16- to 18-year-old (and a few that one probably wouldn't expect). I've decided—admittedly somewhat arbitrarily—to limit the size of this dictionary to 15,200 words. (This is roughly 8.4 percent of the number of words in the large dictionary). The small dictionary is a subset of the large one, i.e., any word that appears in the small dictionary also appears in the large one.

Short words— especially those having six letters or fewer—tend to take precedence over longer words. Among words having more than six letters, I have favored (a) "concept words", (b) action words, and (c) words that describe people, things, and actions over (d) words that name things. Thus, the word evolutionary appears, while the word nucleotide does not.

To request a correction, send me e-mail.

Small, kid-safe spelling dictionary (15,200 words)

To create space for other words judged to be more valuable, many words that are neither slang nor acronyms have been excluded from small.txt and placed in the file overflow.txt instead. These words have more than six letters, are almost always forms of words appearing in the small spelling dictionary, and include mainly words of the following types:

Despite the above, the small spelling dictionary includes the following:
Overflow Dictionary (8,900 words)

Although I do not have a middle-sized spelling dictionary, a spelling dictionary containing roughly 24,100 words can be created by simply uniting small.txt and overflow.txt. (These two files are guaranteed to be disjoint, i.e., any word that appears in one does not appear in the other).

4. How to Use the Spelling Dictionaries in Python (or Why Choosing the Right Data Structure Matters)

When solving many types of word puzzles (e.g., anagrams or word ladders) via software, one must be able to determine whether a given sequence of letters is a valid word. In such puzzle solvers, the efficiency of this test is critical because it will be performed a very large number of times. The naive approach is to store the spelling dictionary in a list data structure, but this leads to gross inefficiency. If one is checking a sequence of letters that (a) is not a valid word or (b) is a valid word appearing near the end of the dictionary, the time required to perform the test will be proportional to n, the number of words in the dictionary. Computer scientists say that such an algorithm has performance that is O(n).

The ideal data structure for this application is Python's set data structure. Because the set is implemented using a hash table, the average time to perform a lookup (of a valid or invalid) word is essentially constant, i.e., independent of n. Computer scientists say that such an algorithm has performance that is O(1).

When n is large, the difference between O(n) and O(1) can be very significant. The following input/output is from an iPython session in which I stored the large spelling dictionary in two data structures—a list and a set—and measured the time to look up the word zeitgeist in each. (I picked this word because it appears near the end of the dictionary). Results show that for a spelling dictionary of this size—181,000 words—runtime performance is improved by a factor of roughly 44,000 by using a set rather than a list!

In [1]: with open('large.txt', 'r') as FILE:
   ...:     words= [ word.rstrip() for word in FILE ]

In [2]: timeit 'zeitgeist' in words
100 loops, best of 3: 4.06 ms per loop

In [3]: words= set(words)

In [4]: timeit 'zeitgeist' in words
10000000 loops, best of 3: 91.7 ns per loop

5. Downloadable Utilities

When working with large word lists, it is often useful to be able to perform various types of automated comparisons and tests. The following Python scripts operate on plain text files containing exactly one word per line.

The Python program difference.py compares the specified two files and reports all words that appear in the first file but not in the second. If, for example, you have a word list called mywords.txt and want to check that those words are valid, you would issue the following command:

   python difference.py mywords.txt large.txt

where large.txt is the large spelling dictionary. (For this command to work, you must have a working installation of Python on your computer, and both input files must reside in the working folder).

Download difference.py

The Python program intersection.py compares the specified two files and reports all words appearing in the both files.

Download intersection.py

The Python program duplicates.py analyses a single input file and reports all words that appear more than once.

Download duplicates.py

6. News Items

2 August, 2013: I've added a section on Python utility programs to this page.

25 July, 2013: Over the last two months, with the goal of improving the educational value of the small dictionary for high-school-age students, I removed over 4,000 words, replacing them with an equal number of vocabulary words of the level that one might expect on such examinations as the SAT or GRE. (Words that were removed from the small dictionary have been stored in the overflow file; they were forms of other words that have remained in the small dictionary).

Last update: 12 Mar., 2017