Common Word Prefixes

Dr. Phillip M. Feldman

Certain word prefixes are fairly common. Knowledge of word prefixes is useful when playing the children's word game "Ghost" because it is advantageous, at least early in the game, to choose letters such that there are many words that begin with the current sequence of letters.

Is any initial sequence of letters in a word a prefix? This depends on one's point of view. When playing "Ghost", it is reasonable to view any initial sequence of letters as a prefix as long as one can make a longer word that begins with that sequence of letters. Thus, in is a prefix of increase. A grammarian, however, would say that in is not a prefix of increase because crease and increase are etymologically unrelated words.

Writing a Python script to identify common word prefixes is a good exercise for someone who is starting to learn Python. The reader is encouraged to try to write a Python script before clicking on the buttons to look at my code; you can compare your output against my output below to determine whether your program is working.

The program prefixes1.py identifies the most common 3-letter prefixes according to the definition of prefix that a Ghost player would use.

prefixes1.py

Output from this program appears below. The third value in each line is the number of words beginning with the given prefix. Note that many of these prefixes, e.g., tra and ove, are not prefixes in the grammatical sense.


 1: ('con', 1992) 

 2: ('pre', 1756) 

 3: ('dis', 1741) 

 4: ('ove', 1711) 

 5: ('non', 1635) 

 6: ('pro', 1477) 

 7: ('int', 1443) 

 8: ('out', 1307) 

 9: ('mis', 1251) 

10: ('sub', 1065) 

11: ('ant', 1057) 

12: ('com', 969)  

13: ('tra', 930)  

14: ('par', 861)  

15: ('sup', 854)  

16: ('res', 846)  

17: ('per', 830)  

18: ('rec', 828)  

19: ('car', 779)  

20: ('for', 739)  

21: ('cha', 721)  

22: ('imp', 690)  

23: ('tri', 686)  

24: ('str', 672)  

25: ('und', 665)  

26: ('cou', 636)  

27: ('rep', 625)  

28: ('hyp', 623)  

29: ('sta', 611)  

30: ('dec', 605)

The program prefixes2.py attempts to identify 2- and 3-letter prefixes that satisfy the grammarian's definition. (One might think that this would require two passes through the list of words, but it does not. The code uses a Python set data structure to be able to rapidly determine whether stripping initial letters leaves a result that is a valid English word). Note that the code cannot do a perfect job of satisfying the grammarian's definition of prefix because the spelling dictionary provides no information about the etymological connections between words.

prefixes2.py

Partial output from prefixes2.py—the 30 most-commonly occurring 3-letter prefixes—appears below.


 1: ('non', 1572  

 2: ('out', 1259  

 3: ('pre', 1140  

 4: ('dis', 1109  

 5: ('mis', 1103  

 6: ('sub', 746)  

 7: ('con', 432)  

 8: ('pro', 347)  

 9: ('com', 240)  

10: ('tri', 229)  

11: ('bio', 211)  

12: ('per', 173)  

13: ('res', 165)  

14: ('par', 161)  

15: ('for', 137)  

16: ('red', 132)  

17: ('air', 130)  

18: ('rep', 125)  

19: ('iso', 121)  

20: ('car', 119)  

21: ('sun', 114)  

22: ('epi', 113)  

23: ('sea', 111)  

24: ('uns', 110)  

25: ('geo', 109)  

26: ('mal', 105)  

27: ('bar', 104)  

28: ('mid', 103)  

29: ('dia', 98)   

30: ('sur', 98)

Note that the outputs from prefixes1.py and prefixes2.py are quite different, e.g., the prefix ove appears near the top of the first list but not at all in the second.

Last update: 1 Mar, 2013