“Corpus (corpora in plural form) means body in Latin. Corpus is also a branch of linguistics where it proposes that reliable language analysis is more feasible with corpora (samples) collected in the field in its natural context (realia), and with minimal experimental-interference” (“Corpus linguistics,” n.d., para. 1) 1.
“A corpus is simply an electronically stored, searchable collection of texts. These texts may be written or spoken and may vary in length but generally they will be longer than a single speaking turn or single written sentence. They are normally measured in terms of the number of words they contain or to use a word common in most corpora, the number of tokens” (Jones & Waller, 2015, p. 5) 2.
Consider this example and examine the tokens below.
They were ready to all but to fight, and were they smart enough to see the consequences.
The above sentence has 17 tokens in it. The analysis of those tokens can be seen below.
|Pronouns||they x 2|
|Verbs||were x 2, see, fight|
|Prepositions||to x 3|
Therefore there are 13 different types in the text.
“Types and tokens can also be compared by dividing the number of types by the number of tokens, giving us a type:token ratio” (Jones & Waller, 2015, p. 6) 2.
Given that the above example has 17 tokens and 13 types in it,
13 divided by 17 x 100, it has a type token ratio of 76%.
“Corpora can be mono-modal (through one medium, typically text) or multi-modal (through more than one medium, typically text and video). Due to costs, most corpora are mono-modal, although increasingly multi-modal corpora are being developed. According to Sinclair (1991), a corpus should consist of a principled collection of texts. This means a corpus should contain texts that can provide answers to questions we want answers to” (Jones & Waller, 2015, pp. 6-7) 2.
“By way of example, if we wished to analyze the performance of learners in a set of English language test, we would need samples of their written and spoken work from the tests to be able to make realistic statements about the language in use. We would also need to make decisions about whether to include students who pass or fail tests with a particular mark. Other variables we would need to acknowledge and control for are the age and nationalities of the candidates. If the test is taken by a range of nationalities, for example, we would need a sample of tests that give a representative sample of those nationalities. We would also need to make a decision about how many words (or tokens) to include. This should be based upon two aspects: what we intend to use the corpus for and, practically, how many texts we can collect in the time available to us” (Jones & Waller, 2015, p. 7) 2.
“In the hypothetical example of the corpus of tests, should we wish to make statements about how a grammatical pattern is used across different levels, then clearly we would need a lot more words than if we wished to investigate how a particular pattern was used only in a written test at one particular level. Finally, we would need to decide upon the type of corpus we need. For example, a mono-modal corpus of texts would give us information about candidates’ writing and speech but in the case of speed, we would be unable to comment upon their use of body language and how this acts to reinforce their message” (Jones & Waller, 2015, p. 7) 2.
“The reason we wish to use a corpus is that we can analyze large quantities of language and uncover patterns of usage which our intuitive sense about language may miss. This then allows researchers to make clearer and better descriptions of language which can inform practice or simply develop our understanding of language in use. When looking at a corpus to make statements about grammar, we are taking a descriptive, as opposed to prescriptive, stance. In other words, we are seeking to show how the language is used and from this make statements about it, such as the rules that re followed. The opposite approach is to formulate a rule, often based on intuition, set this as a ‘standard’ and then attempt to suggest that deviations from this are not correct. This is a prescriptive stance (see Freeborn, 1995 for a useful discussion of this area)” (Jones & Waller, 2015, pp. 7-8) 2.
“The first thing we can use a corpus for is to test and challenge our intuitions about language. A corpus may underline or refute an idea we have about language use. To give an example, we may wish to uncover how speakers report what others are saying in conversation. We may assume that speakers always use ‘back-shift’ to report speech and employ a ‘rule’ often given in grammars of English (for example, Murphy, 2012), where the common ‘formula’ is ‘He/she said that + back-shift’ e.g. ‘He said that he was going’ to report ‘I’m going’. However, when this aspect of language has been investigated using corpora, it has been found that this does not always follow. McCarthy and Carter (1995), for instance, looked at the five million-word CANCODE corpus of spoke English and found that ‘X was saying + summary with or without back-shift’ e.g. ‘She was saying she’s starting a new job’ was also a very frequent way of reporting what others have said in conversations. Carter and McCarthy (2006) describe reporting using the Cambridge English Corpus and suggest that while back-shift is common in conversations, it is also the case that reported and direct speech are often mixed together to make more vivid stories” (Jones & Waller, 2015, p. 8) 2.