Corpus-building for the inquisitive teacher-researcher 

by Martha Partridge

Introduction 

Corpus Linguistics (CL) appears to intrigue people. Associations of forensic analysis and archaeological investigation seem to abound: examining collocates; digging for words; mining language. CL studies often refer to indecipherable statistical measures and linguistic abbreviations, too: n-grams, LogLikelihood, T-score, chi-squared test, lempos, KWIC – the list goes on. CL could therefore seem a little intimidating, and perhaps curious teacher-researchers are deterred from trying some corpus-based experiments themselves. Approached systematically and with curiosity however, CL offers an accessible and adaptable method of language study. 

To offer a definition, CL is the analysis of “a collection of naturally occurring examples of language … collected for linguistic study” (Hunston, 2002, p. 2). This collection is known as a ‘corpus’, or ‘body’ of text, and has been a feature of linguistic research for longer than most would imagine. For instance, Käding (1897) collected and analysed – manually – a German corpus of 11 million-words. Unsurprisingly, when computers became available they were quickly adopted as a necessary tool to working with corpora, giving rise in the late 1950’s to CL as it is now recognised (McEnery & Hardie, 2013). Not only does this digitisation allow for the use of increasingly sophisticated analytical tools, but it also means it is possible to examine far larger bodies of language than ever before (Hoffman et al, 2008); enTenTen20, for example, is a web-based English language corpus of 36 billion words (Sketch Engine, n.d.).  

The value of CL therefore lies in numbers; it is a data-driven approach to linguistic analysis, which helps dampen the effect of researcher bias as well as boost the reliability, and therefore meaning, of results (Baker, 2006). Whether these results are useful relies largely on the quality of the corpus from which data is extracted (Reppen, 2010). To study general language use, corpora such as enTenTen20 or the British National Corpus are large enough to be accurately representative of an entire language, and are used as such (Hoffman et al, 2008). However, if a particular type or aspect of language interests you, you will likely need to create your own corpus. I find that this is generally a monotonous yet un-taxing procedure, though one of great importance; your corpus forms the foundation of all the subsequent analysis you will do.  

With this in mind, I have reflected on my own (non-expert) experience of corpus-building. This includes two assignments and a dissertation for an MA in Applied Linguistics, and my current project on feedback comments among EAP teachers at CALD: four corpora, ranging from 2,110 to 2,347,926 words. The first three times, I planned and failed to take ‘notes for next time’ to remind myself of mistakes to avoid and successes to repeat. Resisting the lure of procrastination this time, I have noted them below, having just completed my fourth corpus. I hope they may be of interest, and perhaps save you time and frustration in your own corpus-building experiments.   

  1. Have clear aims 

Understanding why you are building your corpus is an essential starting block (Reppen, 2010). This may sound obvious, but I have found that flaws in my research questions have come to me after I have begun building my corpus, meaning I have had to start again. This could pertain to the search terms used to find relevant texts or the time span investigated, for example. Recently, I adapted my corpus of teacher feedback comments to include two sub-corpora, allowing for comparison of feedback at early and late stages of the IFP course. If I had been clearer on my research aims from the beginning, I would have saved myself time.  

I think two things can help you be clear on why you are building your corpus: reading, and talking. There are several accessible guides to CL research, with sections detailing the building of corpora and the relevant considerations. I found McEnery and Hardie (2012) and Baker (2006; 2010) particularly useful. It also helps to see the process of corpus-building in action; the closest thing being the methodology sections of published CL-based studies. Interesting socio-linguistic works include Jaworska and Krishnamurthy (2012), Baker (2014) and Karimullah (2020) – these explain the rationale behind how they built their corpora.  

Finally, discussing your ideas with others will help you notice holes in your research aims and corpus design. If you find you cannot explain why you have made a certain decision regarding the latter, or lack clarity on the former, then these details need to be ironed out before you start corpus-building. As well as colleagues, the Facebook group ‘Corpus Linguistics’ has over 10,000 members and has proved a useful source of ideas and advice for me.   

  1. Note ideas while you build 

Building your corpus normally involves copying and pasting from your source into a plain-text document. You generally do not read everything, as this would be too time-consuming. After all, a common reason for using CL in the first place is to analyse a quantity of text that we do not have time to read. However, you will inadvertently end up skimming your data, and as you do, will become familiarised with it (Baker, 2006). This is a key advantage of building a corpus yourself, as you will begin to pick up on patterns which may be worth further investigation. For example, while building a corpus of newspaper articles about feminism for my dissertation, I noticed how often the names of female celebrities appeared. These may not have been picked up as high frequency words by the software I used later in the process. However, following up on this observation resulted in a useful case study on the representation of Emma Watson, which complemented the rest of my research.  

Ideas for avenues of inquiry such as this case study come to you while you (fairly absent-mindedly) build your corpus. Write them down when they do float into your head – once you begin all the exciting work on collocations and frequency lists, realising with some surprise that you have found something interesting, you will likely forget them. They are worth remembering though, as they are often the slightly more creative, nuanced branches of a CL-based study which add depth, richness or another perspective to its methodical trunk.  

  1. Build within your limits 

There is no simple rule dictating the ideal size of a corpus (Reppen, 2010). This depends largely on three factors: the type of language analysed; the aspect of language analysed; and the resources available (Baker, 2010). For instance, my smallest corpus of 2,110 words was created to assess the linguistic difficulty of two documents given to foreign national prisoners informing them of their deportation. Therefore, these documents alone were completely representative of the language I was interested in. Most of the time however, it is not viable to achieve total representation, in which case the aim is to collect enough examples to accurately represent the language investigated (Reppen, 2010). But what is ‘enough’? There are statistical approaches to estimating this (Caruso, 2014), but unless you are conducting empirical research, my advice is to use published CL studies as guides. The largest corpus I have built – 2,347,926 words – was used to explore the representation of feminism/ists in British newspapers. One key aim was to compare findings with a previous study, which used a corpus of 2,388,004 words (Jaworska & Krishnamurthy, 2012). This was reason enough to decide the size of my corpus. 

However, the final limiting factor on corpus size is an important one: the resources of time and money available to the researcher (Baker, 2010). A heavy work-load or lack of funding limits the extent to which efforts can be dedicated to corpus building. The mental toll of such a screen-based, repetitive process should also not be ignored (Reppen, 2010). So you need to build within your limits. What is more, many argue that meaningful results can still be taken from a fairly small corpus, particularly for pedagogic purposes (see Reppen, 2010, pp. 54-55 for a brief discussion). Besides, you can always treat your initial investigations as exploratory, identifying avenues of further analysis for the future. 

Personally, I err on the side of growing my corpus to as high a word count as is viable. In my experience, it is tempting to cut short this process to skip to the interesting stage of extracting collocations and word lists. On my current project, I reached 10,000 words and arbitrarily decided that was sufficient. After some initial analysis, however, I realized that a) there were some potentially interesting patterns, but not enough instances to be statistically worthy of analysis; b) I could make the corpus fully representational by including all relevant feedback comments if I pushed on to around 20,000 words. So, it is worth spending time growing your corpus to as large as you can manage, remaining mindful of your limits. 

  1. Be pedantic 

In a similar vein, corpus-building lends itself to the detail-oriented, pedantic individual. The type who notices incorrect apostrophes on signage and is partial to colour-coding excel spreadsheets. I am one of these people sometimes, but not always. When building a corpus though, I try to channel the pedant in me, because quality data means quality results. Indeed, Kennedy (1998, p. 68) asserts that researchers should “bear in mind that the quality of the data they work with is at least as important [as the size]”. The software or application you will later use to extract terms from, be it Sketch Engine, AntConc or LanxBox (see Berberich and Kleiber, 2020, for a comprehensive list), will not recognise a mis-spelled or abbreviated word. This can affect your results significantly. I am currently analysing teacher feedback comments; if I had relied on myself and my colleagues accurately typing ‘thesis’, ‘paragraph’ or ‘similarity’ – not ‘thessis’, ‘pragraph’ or ‘similiraity’ – 100% of the time, my results would look slightly different.  

Cleaning your corpus does not need to take up much time – I copied my corpus into a Word document and checked the spelling suggestions in the Editor tool. This process picked up plenty of typos, abbreviations, student names and html tags I had accidentally left in. This was all useful – I could then decide whether ‘e.g.’ should remain as an abbreviation, or whether to change all instances to ‘for example’, so that I would get a reliable representation of how often teachers include examples in their feedback. The use of student names also indicated a potentially interesting variation in personalisation among comments – could this form an avenue of investigation to triangulate my initial findings with? I noted this down of course, following my own advice in section 2 above. 

Take the time to polish your corpus and make decisions about abbreviations and spelling conventions (American or British, for example?). Drury (2022) suggests creating a glossary, particularly if multiple people are building the corpus, to ensure consistency – ‘analyse’ or ‘analyze’; ‘e.g.’ or ‘for example’. Drury (ibid) also recommends Notepad++ (Ho, 2022) for editing multiple files within your corpus simultaneously, which I have not used. However you do it, attending to details will mean you have a robust, reliable corpus to base your analysis on later.   

  1. Accept the iterations 

Though you have just read my warnings and suggestions in this post, it is likely you will forget most of them and make very similar errors while building your own corpus. That is fine, because it probably means you are engaging with your data, questioning initial results and identifying improvements to your method. So, accept that the process is an ‘iterative’ one – a term which makes me smile a little when I read it in journal articles because it suggests the researcher was working out what they were doing as they went along, realising they were not quite right at first and re-visiting things to iron out the kinks. I find this reassuring, and it is a reminder of the necessarily cyclical nature of any research project. So, welcome these iterations as part of your corpus-building adventure and not a frustrating waste of time. Turn them into a blog post to share with your colleagues.  

References 

Baker, P., 2006. Using Corpora in Discourse Analysis. London: Continuum. 

Baker, P., 2010. Sociolinguistics and corpus linguistics. Edinburgh University Press. 

Baker, P., 2014. Bad wigs and screaming mimis’: Using corpus-assisted techniques to carry out critical discourse analysis of the representation of trans people in the British press. Contemporary critical discourse studies, 211-235. 

Berberich, K. & Kleiber, I., 2020. Tools for Corpus Linguistics. Available at: https://corpus-analysis.com/ [Accessed 21 June 2022]. 

Drury, A., 2022. Creating a mathematics corpus and keyword list. [unpublished powerpoint & workshop). University of Leeds.  

Gabrielatos, C., & Baker, P., 2008. Fleeing, sneaking, flooding: A corpus analysis of discursive constructions of refugees and asylum seekers in the UK press, 1996-2005. Journal of English Linguistics, 36(1), 5-38. 

Ho, D., 2022. Notepad ++. V8.4.1. [Software]. [Accessed 21 June 2022]. 

Hoffman, S., Evert, S., Smith, N., Lee, D., Prytz, Y. B., 2008. Corpus Linguistics with BNCweb – a Practical Guide. Peter Lang. 

Hunston, S., 2002. Corpora in Applied Linguistics. Cambridge University Press. 

Jaworska, S. & Krishnamurthy, R., 2012. On the F word: A corpus-based analysis of the media representation of feminism in British and German press discourse, 1990–2009. Discourse & Society 23, (4) 401–431.    

Käding, J., 1897. Häufigkeitswörterbuch der deutschen Sprache. Steglitz bei Berlinn: Selbstverlag 

Karimullah, K., 2020. Sketching women: A corpus-based approach to representations of women’s agency in political internet corpora in Arabic and English. Corpora, 15(1), 21-53. 

Kennedy, G., 1998. An Introduction to Corpus Linguistics (1st ed.). Routledge. 

McEnery, T. & Hardie, A., 2012. Corpus Linguistics. Method, Theory and Practice. Cambridge University Press. 

McEnery, T. and Hardie, A., 2013. The history of corpus linguistics. In: K. Allan, ed. 2013. The Oxford handbook of the history of linguistics, pp.727-745.  

Reppen, R. (2010). Building a corpus. In: A. O’Keeffe and M. McCarthy, eds. 2010. The Routledge handbook of corpus linguistics, pp.31-37.  

Sketch Engine, (n.d.). enTenTen: Corpus of the English Web. Available at: https://www.sketchengine.eu/ententen-english-corpus/ [Accessed 21 June 2022]. 

7 thoughts on “Corpus-building for the inquisitive teacher-researcher 

  1. So interesting! This is a really great contribution.

    I’m very interested in how you have explored CL and unpack your valuable experience. I wonder whether there is an opportunity to build authentic functional discourse resources from CL, that might support learners formulate their critical ideas?

    Inspiring work Martha.

    – Maggie

    1. Thank you Maggie! Your idea for using CL to build functional discourse resources sounds interesting – we will have to catch up about that.

      I look forward to reading a post from you soon!

      Martha

  2. What a great blog post, Martha! To be honest, I’ve never really been that drawn to CL, but you’ve persuaded me to consider giving it a go (I’m one of those annoying people who spot the incorrect apostrophes on signage). I’d be very interested to hear a little more about your adventures and findings!

    1. Thanks for commenting Cathy! It sounds like you would suit CL if you have an eye for detail! The sociolinguistic studies I mention in the post are motivating in terms of trying out CL yourself, as the findings are so interesting 🙂

      1. Yes, that type of study does sound motivating. It would probably be very good research for creative writing too, and might help in the quest to create an authentic sounding voice. As I think you know, I write historical fiction, and am now wondering how I could use CL to help me in the research process. Any ideas very gratefully received!

Leave a Reply

Your email address will not be published. Required fields are marked *