Intro to Corpus Linguistics

· @tasali · ELIN479, Week 2-1

We studied on the unit What Is Corpus and What Can It Tell Us?

I think corpus/corpora is an interesting branch of the linguistics and similarly, it is close to my previous interests, spaCy, for instance.

Language is infinite. Creating a corpora that include all the details of a language is nearly impossible for that reason.

Below are the properties of a corpus.

  • Electronically stored.
  • Searchable
  • Decoded and ready to use.
  • Should represent the language in use.
  • Written or spoken.
  • Should be in a specific size.
  • Bigger than a sentence and in a format that contains paragraphs that are related to each other.

Types are tokens but recurring types are only counted once.

Tokens are the total number of words in a text.

They were ready to all but to fight, and were they smart enough to see the consequences.

The above sentence has 17 tokens in it. The analysis of those tokens can be seen below.

Pronounsthey x 2
Verbswere x 2, see, fight
Nounssentence, consequences
Adjectivesready, smart
Determinersthe, all
Prepositionsto x 3

Therefore there are 13 different types in the text.

Note that different types (e.g., adverbs and adjective) of the same word counts separately. For instance, in the case of the below sentence, you are going to count the word β€œenough” as 2 different types because the first one is an adjective and the other is an adverb.

With enough sugar, you can make a cake that is big enough to feed 10 persons.

Written corpus is more widely used compared to spoken one. The ratio is similar to 9/1. That is because written text is easier to reach and spoken language is harder to process.

There are different types of corpora. If it is concerned with single source, it is mono-modal. If it is with a multiple sources, for instance, written together with body language, physical expressions which may be gathered using a video, it is multi-modal. And lastly, monitor-corpus for which the details have been given at

There are two different approaches when it comes to examining a language and creating corpora. In descriptive, we get the data as it is (including errors). With prescriptive approach, there is an expectation on how the language will be.

With corpus linguistics, one can understand how a language is used in specific time-period, place and with how much frequency.

Corpus linguistics can also tell us which words are most often used with a specific word, called collocation and the syntax used with it, called colligation.

We will study lexicogrammar in the upcoming classes.

We should study Definition of the Descriptive Grammar for the next week.