I before e

Despite the fact that English is an absolutely terrible language and nobody should speak it, people still do. So to cope with the many irregularities and near impossibility of getting anything right, people try to come up with catchy rhymes, like for “I before e“:

I before e

except after c

or when sounding like A,

as in neighbour or weigh

 

That’s the rhyme I learned in school, it’s the only one I know of. People sometimes forget the last two lines. The wiki entry mentions it includes a “long e” sound, but since that’s not part of the rhyme I’m going to ignore it.

I was wondering how accurate the rule really is. I checked the literature (re: spent 5 minutes on google scholar and wikipedia) and it seems that nobody has done a full analysis. Sadly I was not able to do one myself, but I do have some useful info

Methods

As with all things in life, we start with wikipedia. Actually, wiktionary in this case. Spelling and definitions for about 3 million English words. Checking through all of these words to see that “ei” is preceded by “c”, and that “ie” is NOT, isn’t a big deal. Downloading the wiki dump took longer[1]. No, the hard part is pronunciation.

In order to make sure that the “ei” or “ie” in question did (or did not) make the long a sound, one has to check the pronunciation. Of the (approximately) 20,000 words which contained either of those letter pairs[2], only around 800 had pronunciation information. I’m assuming that these are the most commonly used words, though, so the results here would be most applicable.

Of the downloaded corpus of words, I found 744 which had IPA pronunciation data. Often multiple dialects are given, I took only the US dialect if multiple were available. Might be interesting to re-run this with other dialects, but mainly I just wanted a consistent set.

IPA uses “eɪ” to represent the long-a sound. My method of figuring this out was to look up a few words and see if those characters were used, they always seemed to be. IPA is very extensive, there may be other combinations which would be close enough, my analysis would incorrectly classify those words as failing the rule.

I used a very simple method to figure out which part of the pronunciation key was represented by each letter in the word. This is necessary because the “eɪ” sound could exist in the word, but not be created by “ei” or “ie” (e.g. piemaker, although since that didn’t have explicit pronunciation info it wasn’t part of the sample set)  Just figure out the average number of letters used in pronunciation versus spelling, find the location of “ei” or “ie” in the word, and advance that number of letters in the pronunciation * the ratio.

For instance, “weight” has 6 letters, but it’s IPA pronunciation “weɪt” has 4. So each written letter represents 2/3rds of a pronunciation letter. “ei” starts in position 1 (I always index from 0) in the word, so we start at position 2/3 in the pronunciation, which in this case is “w”. Extend by 1 letter at the beginning and end for safety, and the full section we check is “weɪt”. Obviously this will work really well for short and/or single-syllable words. Most words seem to be close in length to their pronunciations, so I would expect this to be fairly accurate on average.

Results

Of the 744 words, 522 obeyed the full rule, and 222 violated it, for a success rate of about 70%. Not terrible, but not great either. My method has some flaws; “their” has a very strange pronunciation, one could argue it should fit the rule but it was considered a failure by my program. “Neighbour” and “weigh” were classified correctly, thank FSM. Since this was just a sampling, a basic 1/sqrt(sample size) would indicate an expected error of 3%. That’s on top of errors which might happen in processing, unfortunately I don’t have an estimate for that[3]. Glancing through the list I can see some incorrect classifications, so this is by no means perfect.

So when in doubt, use the rule. But don’t use it to challenge the spelling of any word if a reliable source says otherwise, because 70% is pretty low if any other evidence exists.

Resources

Code: https://gist.github.com/3408990

It’s python code, but doesn’t have any requirements beyond the standard library. Needs to have the wiki data dump in the same directory. That’s about a gig so I don’t want to mirror it.

Processed word list: ie_result_summary

  1. [1]http://dumps.wikimedia.org/enwiktionary/latest/ Retrieved about Aug. 12, 2012. File enwiktionary-latest-pages-articles.xml.bz2.xml.bz2
  2. [2] I exclude words which had both to make processing easier and more accurate
  3. [3] I’d have to manually check 50-100, and frankly I don’t feel like it
This entry was posted in Text Mining, Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *