Thursday, October 20, 2005
« In which the Dunce finds true love | Main | Found! »

While writing a previous entry I noticed a high frequency of the term "fortunately" in my posts. Perhaps I've had many fortunate experiences, or perhaps I've been telling lots of tales involving possible misfortune, but in which the worst possibilities did not come to pass. Or maybe I just like the word "fortunately". Anyway, since I've been doing some simplistic work analyzing corpora of texts, I thought I'd turn these analyses on my own blog entries and see what other atypical patterns of word choice are present in my writings (up to and including my last entry). I am focusing here strictly upon word frequency: what uncommon words do I use especially frequently? what common words do I use less frequently than would be expected? And what do I write about the most, just in terms of the content words I recycle again and again?

For the sake of simplicity I am using a somewhat out-of-date word frequency database (Kucera & Francis, 1967. Information on the corpus can be found here); this was once the accepted source of word frequency information (approximately 1,000,000 words from 500 different sources), although much larger texts have since supplanted this database (for example, the British National Corpus is based on 100m words). To give you an idea of the distribution, here are a few of the most common words in the K&F corpus and how often each one occurred:

THE 69971
OF 36411
AND 28852
TO 26149
A 23237
IN 21341
THAT 10595
IS 10099
WAS 9816
HE 9543

I combined all the text of my blog entries (including titles, picture captions, and the text of hyperlinks, but not including dates, category labels or comments) and calculated how often each word occurred (a handy online tool for doing this can be found here). I discarded all words that occurred less than five times, and obtained K&F frequency values for each of the remaining words (a handy tool to do this and more can be found here). My ten most frequently used words were quite similar to the K&F set (above):

THE 3218
A 1663
OF 1646
TO 1477
AND 1242
IN 994
I 942
IS 602
FOR 478
IT 470

There are generally similar patterns between the two although I am clearly talking about myself more than the K&F sources ("I" is the 7th most popular word in my writing, and 20th most common in the K&F corpus), and less about other men ("HE" is #10 in K&F, but barely squeaks into the top 50 in my list).

When it comes to "fortunately" (and words like it), unfortunately I neglected to consider an important aspect of the K&F frequency database: it seems that certain kinds of derivational terms were counted under their stem rather than as a specific wordform. So "fortunately" (which I have used 40 times) did not ever occur in the K&F database. Nonetheless, a list of my most frequently used words that never occur in the database is still somewhat informative about my usage tendencies. Among those that don't occur for derivational reasons are (in decreasing order of frequency)

especially (50)
seems (50)
fortunately (40)
words (33)
times (31)
folks (27)
things (25)
minutes (23)
probably (23)
definitely (22)

So it's not just "fortunately" but quite a few other similar adverbs that characterize my writing. Some other terms that I use frequently but don't appear in the database are contractions (I'll, 51; that's, 32; I'd, 31; there's, 21) or abbreviations (ABV, 40; UK, 33; OED, 23). Once all of the above are excluded we are left with the terms that I definitely produce more frequently than the database would predict:

dunce (61) (no surprise there)

bike (39) (I am quite bike-obsessed, and perhaps this abbreviation for "bicycle" is more popular now than in the mid-60s? It's been around since the 1880s, though.)

blog (30) (a very new term: OED's earliest citation is 1999, although the source "weblog" is seen as far back (!) as 1993.)

google (24) (rarely used except in cricket until 1996)

Tallinn (19) (I guess there was not so much mention of Soviet cities in the [American] texts that made up the K&F corpus).

website (14) (another new one; OED's first citation ("WEB site") is from 1993)

spam (14) (The product made of pork shoulder and ham certainly existed in the sixties, but this dirty little secret was brushed under the rug as far as the frequency corpus goes. Spam as a verb dates back only to 1991, again according to OED [but which does not mention the Monty Python origin)


So there are a few (but not many) quite predictable terms that I use more often than the corpus would predict. Now how about the other direction? I selected the 200 most frequent words in the K&F database and checked which (if any) I used less than five times. There were four such words: (wept, 507; united, 482; government, 417; knew, 395). "Wept" and "knew" are irritating because these are clearly derived from "weep" and "know" (why do these appear in the database, but "especially", "seems" and "fortunately" do not? Probably because they're irregular, but still...). I don't use the word "weep" in regular conversation unless I'm being dramatic, but am surprised not to have mentioned "knew" given my constant discussions that seem related to knowledge). "United" and "government": my infrequent use of these terms is probably a very good sign that I'm not a political blogger (I get riled up enough writing about traffic, meal times; classifications of nerds and so on).

Finally, I looked at all of those words that appear both in the frequency database and my own writing. I did some statistical tricks1 in order to assess which words occurred unexpectedly often in my writing (as predicted by K&F frequencies), and which words occurred unexpectedly rarely. Here are the results:

My "unexpectedly often" words came from specific topic areas which I must admit I've spent perhaps too much time on: the consumption of alcohol (pub, ale, beer, cider), transportation (zebra, bus, cycle, traffic, destination, commute, London, route), language (noun, etymological, Albanian, verb, slang), and other more specific matters which have drawn my attention (marmalade, Portuguese, quince; slug, bug; badminton). Strangely very little about music ("festival" had a z-score of +1.79 but I've also referred to beer festivals). I should also note here that "toilet" still appears more often in my language than would be expected. I'm still the same little boy who got in trouble on a third grade assignment to write sentences including the words from that week's spelling list. All of my sentences included the word "toilet", and I was therefore given the opportunity to write "toilet" another 500 times. It clearly didn't cure me of it. In general, I also used content words (the, a, an, to, etc.) more often than would be expected from the corpus; perhaps this comes from my (attempted) conversational tone.

When it comes to words I didn't use as often as would be expected, there were a lot of male terms (men, himself, man, "John", Mr., him), and a lot more terms which you'd expect to see a lot on your bog-standard political blog (system, social, state, development, program, action, war, court, general, power, against, society, American, freedom, business). Am I intentionally avoiding these hot-button topics? Yeah, I guess so.




1Technical note: Frequency data like these are notoriously exponentially distributed, so in order to do this comparison I first transformed frequency by taking the logarithm, then converted the log frequencies into z-scores within each sample (K&F z-score for "the" = 4.16; K&F z-score for a word with frequency 1 = -3.22). I took the difference between K&F z-score and the z-score derived from my own word frequencies as a measure of the difference beyond the distributional patterns.
Thursday, October 20, 2005 12:22:30 PM (GMT Standard Time, UTC+00:00)  #    Disclaimer  |  Comments [3]  |  Related posts:
On the latch
Going for a slash
Are British children more neuter than US children?
Warm spiced wine keeps the mold away
Gang names
Using pubs

Thursday, October 20, 2005 5:12:54 PM (GMT Standard Time, UTC+00:00)
Methinks you have too much time on your hands! I have some walls that need washing....
Big Mama
Friday, October 21, 2005 3:12:55 AM (GMT Standard Time, UTC+00:00)
It seems especially that you probably definitely use certain words more times than other things. Fortunately, all the folks knew within minutes just what you meant.
Friday, October 21, 2005 8:36:20 AM (GMT Standard Time, UTC+00:00)
Well, B.M., I had to do the same kind of analyses on another sample of text for WORK PURPOSES so it was just a matter of running the same programs on my own text. With the excuse that I was making sure things worked properly on a known sample before moving to a larger (important) sample.
Name
E-mail
Home page

Comment (HTML not allowed)  

Enter the code shown (prevents robots):