COR: Corpus Linguistics

Francis Bond, 2011, 2012, 2014, 2018, 2020, 2024.

This course is an introduction to the fast growing field of corpus linguistics. It aims to familiarise students with key concepts and common methods used in the construction of language corpora, as well as tools that have been developed for searching and using major corpora such as the British National Corpus, Czech National Corpus, and CJK corpora. Students will be given hands-on experience in pre-editing, annotating, and searching corpora. Criteria and methods used for evaluating corpora and analytical tools will also be discussed. This lays the groundwork for research using big data. The main aim of this module is to master the uses of text corpora in linguistics research and data analysis.

Course Content

This course introduces basic corpus skills for linguists:

Course page (here); Source on Github

There is no text book, readings will be assigned each week.

Course Outline

Lecture Topic Readings/Extra Information/Tools Assessment Fun
1 Basic Concepts, What can we do with Corpora? Corpus and Text: Basic Principles (Sinclair 2005) in Wynne (2005)

2 Markup and Annotation Adding Linguistic Annotation (Geoffrey Leech 2005) and
Metadata for Corpus Work (Lou Burnard 2005) in Wynne (2005)
NTU-MC Tagsets: cmn; eng; jpn; ind; universal; universal (old version);

Lab 1
Phonetic Punctuation Victor Borge
3 Representativeness and Balance
Multimodal and Multilingual Corpora
Koehn (2005); Martin et al (2007) and
Character Encoding in Corpus Construction (Anthony McEnery and Richard Xiao 2005) in Wynne (2005)
Discuss the results of the tasks
Lab 1 due
Email corpus choice for Lab 2

4 A survey of Available Corpora Various Corpora
Open Korean Corpora
Balanced Corpus of Contemporary Written Japanese (BCCWJ)
Present Lab 2
5 DIY Corpora, Web as Corpus, Processing Raw Text, SQL NLTK Chapter 1;
sqlitebrowser (DB Browser); SQLite tutorial
NTU Multilingual Corpus;
Databases (zipped): English, Chinese, Wordnet, All of them in one zipfile.
Lab 3 Bobby Tables
6 Collocation, Frequency, Corpus Statistics Dunning (1993)
Corpus test Wizard
Social Science Statistics
Project 1 Passives at the Language Log
7 Encoding, tokenization + CJK Corpora
Guest Lecture by Martina Jemelková Lab 3 due
Lab 4 intro

8 Lexical and Grammatical Studies, Variation Biber et al. (1998) Chapters 2, 3
9 Case studies: Pronouns and Classifiers Bond et al. (1995), Bond (2005), Seah and Bond (2014) Lab 4 due
Project 1 due
10 Corpora and Language Engineering Corpus and Text: Basic Principles (John Sinclair 2005): Chapter 1 of Wynne (2005) Project 2 Description
11 Conclusions and Review
Project 2 Presentation: online
sign into teams and be ready at 10:30.
Stubbs 9

Extra Case Studies Contrastive and Diachronic Studies Biber, D., S. Conrad & R. Reppen (1998), Xiao, Richard, Lianzhen He and Ming Yue (2008)

Slides may be updated at any time! Labs may also change.

Online Corpus

Recommended Readings

Projects that became papers

Assessment (COR)

Learning Outcomes

On completion of this module, students should be able to:

The raw material for these slides is hosted at https://github.com/bond-lab/Corpus-Linguistics

This course was originally developed at NTU as HG3051: Corpus Linguistics and co-taught with the post-graduate course HG7032: Topics in Corpus Linguistics.


Francis Bond <bond@ieee.org> <francis.bond@upol.cz>
Home page