Using Google as your Corpus

Home / Language Resources / Using Google as your Corpus

Since the inception of computational linguistics, corpus linguistics has taken huge leaps forward. What’s corpus linguistics? It’s the process of taking massive amounts spoken or written language, putting it all in a searchable database (something Google does very, very well), and then being able to find patterns of how language is used.

What does this have to do with language learning?

Corpora – plural for corpus – are treasure troves of information. You can see how languages change over time, figure out when words are first introduced to the language, and understand how language usage changes from one situation to another. The language that’s used in politics, for example, may be different from the language used in sports media.

The most popular corpus for American English is COCA, or the Corpus of Contemporary American English. You can actually use it for free and it’s fun to poke around.

Using the Corpus

To use any corpus you need to put in the word you’re looking for. You may see the word ‘lemma’ in your search tool. A lemma is the base word that you’re looking up. For example, a fun word (or lemma) to use in an English corpus search is ‘run’ because it has many different meanings.

Think about the word ‘run’. We have:

• To run a race (very common).
• To run for office (common in the US right now because of the Presidential campaign).
• Running out of time (still common).
• Run up a bill (less common).
• Etc. Etc. Etc.

The corpus will give you TONS of information on how the word is used in written and spoken language.

You’re probably thinking: Great! Another neat tool that I really don’t have time to use.

That’s OKAY!

Let’s get real, learning another language is complicated enough without throwing in a bunch of linguistic mumbo-jumbo.

The poor man’s corpus (and by poor I mean not a lot of time) is Google.

If you’re looking for ways to use a word or phrase, or unsure of what a word or phrase means, use a Google search as your quick and easy corpus.

I did this for a class of students the other day. We were listening to an interview and the person said “You’re only as good as the artists you’ve booked.” The first issue was the word ‘booked’, but this is easily translated and we can substitute other, higher-frequency words, such as ‘reserved’ or ‘scheduled’.

The challenging part was the phrase ‘You’re only as good as…’ which is an expression in English, not a common expression, but you do hear it from time to time.

The students needed other examples.

I couldn’t think of any other examples.

Enter Google – the .2 nanosecond corpus.

Type in “only as good as” and here’s what comes up:
Only as good as

Only as good as

Only As Good As

Employing Google as your very own corpus allows you to get a quick overview of how the expression “only as good as” is used.

The last thing Google does is it gives you a count on the findings. This is helpful if you’re not sure which of two versions is more common. So first you have “You’re only as good as the company you keep,” which is the original, historically correct version of the quote with 650 million hits.

Only As Good As
Okay – this one is popular.

Next you have “You’re only as good as your last game,” which is used in sports media and, while perfectly acceptable, less widely used with only 100 million hits.
Only as good as

What you’re really looking for in the search results is something that has millions of hits and something that only has a few thousand hits. This is when the results can give you some good feedback on your language use.

If you come up with a good example, post it below!

Interested in corpora for different languages? Check these out:

Easy Reading Strategies
6 Ways To Measure A Language’s Difficulty
You’re Not Really Done When You Get To The End Of A Text

  • carin chapin

    Wow. A whole new resource. Thanks!