Thursday, June 30, 2005
« Eavesdropping Heaven | Main | Traffic calming »
For work-related purposes I've needed to conduct a variety of text analyses, and thought I'd learn the ropes with some recent publically available texts. Why not choose recent speeches by politicans, I thought? Of course GW Bush's recent speech at Fort Bragg came to mind first.

In a first pass I simply counted the frequency of each word in his speech, then examined collocates (i.e. words occurring nearby) to unusually frequent words. Unsurprisingly the very most common words were closed-class (in decreasing order of frequency: the (442 times), and, to, of, in, our, a, is, we, are, that, their, they (76 times)). Most of those are also the most frequently occurring in the language as a whole, but the occurrence of pronouns "our", "we", "their", "they" is unusually high in Bush's speech (respectively 6th, 9th, 12th, 13th most common; in a "standard English corpus" [Kucera and Francis, 1967], those words are 136th, 41st, 40th and 30th). I then looked at the collocates of these terms to see what they co-occurred with. In decreasing order of frequency, the immediate collocates (just before or just after the target word) looked like this:

[of, and, to] OUR [troops, military, strategy, allies]
[and, that, as, if] WE [are, have, will, would, know]
[but, so, and, that] THEY [are, failed, can, have, know, need]
[lose, rebuild, defend] THEIR [own, country, lives, new]

This sort of analysis allows you to create your own speech based on generating random selections according to collocations (re-calculating at each content word, e.g. "Our troops are involved in the training to serve their leaders and 17 nations are German in Iraq.). Of course this is dependent on the corpus -- if you select only one speech, yours is likely to resemble that one quite a lot.

Next I looked at the most frequently occurring content words. Not much of a surprise that the leaders were Iraqi (64), Iraq (58), Iraqis (48), terrorists (46), freedom (40), forces (38), war (34), fight (30), military, security, troops (all 28). Combining the various forms of Iraq* gave 180 occurrences (thus falling just between "of" and "in"). Collocates look quite interesting too:

[the, of, new, train] IRAQI [security, forces, people, government, units]
[in] IRAQ [is] ("in Iraq" occurred 28 times; "Iraq is" occurred 18 times)
[help, the, as, helping] IRAQIS [build, to, will]
[of, our] FREEDOM [in, of]
[The] TERRORISTS [and insurgents, who]

It's interesting to contrast this with Tony Blair's recent speech to the European Parliament. Of course this was a speech with a very different purpose, so we wouldn't expect it to go IRAQ, TERROR, IRAQIS, FREEDOM, IRAQI, IRAQI, WAR, FIGHT, FREEDOM.... His most frequent words again include a lot of closed-class words, plus "Europe" (the [396 occurrences], of, to, in, and, it, a, is, Europe (116), that, we, be, I). A bit more "I" than George, and the content words are much different (Europe, people (44), European (36), debate (28), political (28), social (26), world (26)). Iraq and its variants didn't get a mention, and "terrorists" only twice. Here are some of Tony's preferred collocations:

I [have, want, believe, would]
[if,that] WE [have, are, should, can, need]
[the, modern] EUROPEAN [Union, defence, nations, Parliament]

And here's a Tony sentence generated in the same way: "I have to accept a Europe and to be active player in foreign policy."

I would play with this more, but now it's time to work with the tools instead.