Exploring language

Thanks to Steve Novella of the Neurologica blog, I have discovered a new toy to play with: the Google Ngram Viewer. I’d like to share it with you, and encourage you to play as well.

You may have heard of all the books that Google has been digitizing for their Google Books project. (It caused some stir among publishers and writers – it seems to have been sorted out now.) Well, the Ngram Viewer is a very Google-esque* way of looking at word count statistics from that huge collection of books.

Let’s say you’re curious about the relative popularity of two words – say, “humanist” and “atheist“. Well, you enter them as search terms, and voila:

humanist/atheist unigram graph, 1800 to 2000

Relative frequencies of “humanist” and “atheist” in the Google Books corpus, from 1800 to 2000.

We can watch the relative frequencies of these words over time. Unexpectedly (to me at least), we see “humanist” (blue) overtake “atheist” (red) during the first half of the twentieth century, following a couple of decades (20s and 30s) tracking together.  I’ll leave it up to readers to try to infer the reason for this inversion.

The term “n-gram” (yes, pronounced the same as “engram”, but there’s no connection to neuropsychology or Scientology) is used in corpus-based linguistics to denote sequences of words. A unigram is a sequence of 1 word; in the graph above, we compare the frequencies of two unigrams (relative to the total number of unigrams in the corpus). A bigram is a sequence of 2 words. Trigram: 3 words. From there on, it is common just to use the number: 4-gram, 5-gram, etc.

One more unigram comparison that I thought was interesting: function words. Check out this graph comparing “the”, “and”, “of”, “for”, “a”:

Unigram frequencies for selected function words, 1800 to 2000.

Unigram frequencies of “the”, “of”, “and”, “a”, “for”, from 1800 to 2000.

What is interesting here is that the relative (and even the absolute) frequencies show very little change over two centuries. Think about all of the change in the language that those two centuries represent – from shortly after the founding of America to around the time of the latest millennial fever. And these five words have shown such amazing constancy. Sure, there is some change, but compare those to the changes in other graphs, and the difference is clear.

So, let’s check out a bigram comparison. Here’s a chart of “national debt” and “social security”:

Bigram frequencies for "national debt" and "social security"

Bigram frequencies for “national debt” and “social security” from 1800 to 2000.

I’m no political scientist, but it looks like interest in social security leaped onto the scene in the late 30s, and has been slowly climbing ever since, while talk about national debt (in the English-speaking world) has steadily declined basically since the earliest samples in this corpus.

I could go on all day about this, but I’d rather leave it to you now. Before you take off to do your own informal surveys of this delicious data repository, let me offer a couple of caveats.

First, the numbers are only as reliable as the sources. What are the sources? Google gives some information on this. They note some sources of error; they also acknowledge some inherent biases. For example, there are more computer books in recent years than in the 1800s. Whether this is a problem or not depends on the sort of question you’re asking, and how you are interpreting it.

Second, there are different numbers of books in different time periods. They actually go back as far as 1500, but you get problems when, say, a particular year only has one book published. (Check out the results for that nice constant graph of function words, if you go back to 1500.)

Third, always always keep in mind what it is that you’re measuring. These graphs do not measure belief (search “bigfoot, ufo, unicorn“). They do not measure popularity or approval (search “murder, charity“, or the “national debt, social security” illustration above). They simply measure how often people mention the words (or bigrams, trigrams, etc) in published books. (Periodicals are excluded.)

Having said that, it is still a delightful way to while away a day. If you’re stuck for ideas, here are a couple of classic sources of interesting patterns:**

  • What are the relative frequencies of different number words? Is there anything systematic here? Any surprises?
  • What are the relative frequencies of gender-marked pronouns (“he, she”, for example)? How about gender-marked nouns (“man, woman”)?***

Have fun, my merry scientists!


* Google-esque: powerful, easy to use, with the potential to distract me from real work with its endless possibilities to explore.

** Before doing any search, see if you can guess what the results will be. Form a hypothesis, give a reason for your expectation. If the results agree with your expectation, congratulations! If not, see if you can explain why. Does this new explanation generate predictions about some other word frequency pattern that you could now test?

*** There is at least one pair of gender-marked nouns that seems to reverse the general trend. Can you find them? Why would they be different?



Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: