A little over two years ago, while I was still an undergraduate student at Bangor University, David Crystal came around to give a talk based on his book By Hook or by Crook: A Journey in Search of English. One of the many adventures in language land he talked about was the hunt for isograms: words in which each grapheme occurs the same number of times. For instance isogram is a first-order isogram (or a 1-isogram), because each letter (i, s, o, g, r, a, m) occurs exactly once; deed is an example of a 2-isogram, since both d and e occur exactly two times. There are also a few examples of 3-isograms, such as deeded or geggee, but David was quite adamant that he did not know of any fourth-order isograms.
Naturally, this garnered my interest. It is certainly not a biggie to assume that order of isogram should be inversely related to frequency, i.e. 1-isograms will be quite common, 2-isograms somewhat uncommon, 3-isograms rare, and so forth; but a 4-isogram, while probably exceedingly rare, did not immediately strike me as something I would assume to not exist. So I went and googled isograms. A 4-isogram I did not find, but more questions I did.
For one, there is also the long-standing question of which isogram is the longest in the English language. Dimitri Borgman went out to look for that and he found dermatoglyphics, an attested 15 letter long 1-isogram. Another attested 14 letter examples is copyrightables, which can be prefixed to make 16 letter uncopyrightables. Ross Eckler, in Making the Alphabet Dance, reports attested subdermatoglyphic, a staggering 17 letters, but still a 1-isogram. So we may wonder, (i) is subdermatoglyphic the longest attested isogram?, and (ii), if 17 letters is the longest 1-isogram, what is the longest 2-isogram, the longest 3-isogram, etc.?
From these followed other questions I have not seen anybody else ask—or address for that matter. What is the distribution of different isograms? Is the order of an isogram not only inversely related to token frequency, but also to use? Are higher-order or very long isograms (such as subdermatoglyphics) more likely to be hapax legomena in any given corpus of text?
Right there at David’s talk I already thought why not let a computer figure out if there is a 4-isogram? And of course, all these other questions. It has been a while, but last week I finally sat down and wrote a little script to go through different word lists and give me answers to at least some of these questions. So far, I’ve run the script on Google Ngrams and on the British National Corpus (BNC). Noise is a big problem, especially with Google Ngrams, which has many instances of nonsense tokens such as “IIIIIIIIIIIIIIIIIIIII”, which often turn out to be instances where OCR tried to scan images, or page numbers, tables, indices and such things. The BNC is also not unproblematic, although here especially XML-mishaps seem to be to blame, such as tokens with the string “©” in them. I have thus decided to exclude strings with special characters and numbers in them, bar hyphens. Following that, what seems to help is to cross-check instances from Ngrams and the BNC, i.e. create a list with isograms that have been found in both word lists independently. These seem to be fairly low in noise.
While I have not yet got around to doing more detailed analyses for all the extra questions I raised above, I am glad to report that I have finally found a 4-isogram—with a caveat: it is a name. Nangganangga is the name of a Fijian spirit who supposedly guards paradise against bachelors. While searching the regular internet for Nangganannga turns up more junk than real results, Google Books illustrates quite well that the name is attested a-plenty in English language writing, though, as can be seen from the chart below, people generally appear to have been much more interested in the spirit-thing in the 19th century than they are now. Well, I’m pretty chuffed with my find.
I am hopeful to find the time to look more at the data my script generated over the next few days, and will hopefully soon have at least vague answers to most of the other questions posed. With those results, I am also planning to make the script and my data available to other ‘isogrammers’, so maybe there is more to come. I am sure there must be plenty other interesting questions to ask, other than the few I could think of, and maybe someone else would even care to run the same thing on other languages for comparison. Surely different spelling systems together with the phonological properties of a language must also influence the trends we find in isogramy.