Peter Norvig has published a paper called “English Letter Frequency Counts: Mayzner Revisited” in which he uses the Google English language corpus to update results from Mark Mayzner’s research into word and letter frequencies back in the 1960’s. The title of this post is the second subtitle of the paper and gives the order in which letters most often appear in English. According to Norvig, “Note there is a standard order of frequency used by typesetters, ETAOIN SHRDLU, that is slightly violated here: L, R, and C have all moved up one rank, giving us the less mnemonic ETAOIN SRHLDCU.”
In his paper, Norvig updates Mayzner’s work. In a comment, he notes that “The major difference [between his work and Mayzner’s] is that when he [Mayzner] reports a count of 0, you can’t tell if that means 1 in 100 thousand, or 1 in 100 billion. With my results, you can differentiate these cases pretty well. The ngrams with high counts (like the top 50 bigrams) remain fairly consistent.”
To correct for bad OCR (although he doesn’t say that explicitly), Norvig notes that “I discarded any entry that used a character other than the 26 letters A-Z. I also discarded any word with fewer than 100,000 mentions.” Apparently, a few non-English words slipped in anyway, but this is a good rule of thumb to remember when using the Google English language corpus.
There are a few things that I’d love to have the time to dig in to a bit more. For example, see this chart, showing the frequency at which letters occur at particular positions in words:
Looking at first letters, “x” “q” and “z” have non-zero probabilities of occurring in that position, as one would expect. The relative frequencies are not quite what I’d expect, though, and a deeper drive into the data would be interesting. Is this a bad OCR effect that the data cleaning didn’t eliminate? Or is something else going on?
All in all, an interesting paper, and a fun exercise. For the librarians out there, Mayzner’s original research paper was published as:
Mayzner, M. S., & Tresselt, M. E. “Tables of single-letter and digram frequency counts for various word-length and letter-position combinations.” Psychonomic Monograph Supplements. 1. (1965): 13–32.