COR: Corpus Linguistics

Francis Bond, 2011, 2012, 2014, 2018, 2020, 2024.

Lecture and Tutorial: Mondays, 15:00 – 16:30, Room 2.39, tř. Svobody 26, 779 00 Olomouc

This course is an introduction to the fast growing field of corpus linguistics. It aims to familiarise students with key concepts and common methods used in the construction of language corpora, as well as tools that have been developed for searching and using major corpora such as the British National Corpus, Czech National Corpus, and CJK corpora. Students will be given hands-on experience in pre-editing, annotating, and searching corpora. Criteria and methods used for evaluating corpora and analytical tools will also be discussed. This lays the groundwork for research using big data. The main aim of this module is to master the uses of text corpora in linguistics research and data analysis.

Course Content

This course introduces basic corpus skills for linguists:

Marking up linguistic information
Selecting text
The range of existing corpora
How to build your own corpus
Simple database queries with SQL
Using corpora to test linguistic hypotheses
Using corpora to train language tools

Course page (here); Source on Github

There is no text book, readings will be assigned each week.

Course Outline

Lecture	Topic	Readings/Extra Information/Tools	Assessment	Fun
1	Basic Concepts, What can we do with Corpora?	Corpus and Text: Basic Principles (Sinclair 2005) in Wynne (2005)
2	Markup and Annotation	Adding Linguistic Annotation (Geoffrey Leech 2005) and Metadata for Corpus Work (Lou Burnard 2005) in Wynne (2005) NTU-MC Tagsets: cmn; eng; jpn; ind; universal; universal (old version);	Lab 1	Phonetic Punctuation Victor Borge
3	Representativeness and Balance Multimodal and Multilingual Corpora	Koehn (2005); Martin et al (2007) and Character Encoding in Corpus Construction (Anthony McEnery and Richard Xiao 2005) in Wynne (2005)	Discuss the results of the tasks Lab 1 due Email corpus choice for Lab 2
4	A survey of Available Corpora	Various Corpora ・Open Korean Corpora ・Balanced Corpus of Contemporary Written Japanese (BCCWJ)	Present Lab 2
5	DIY Corpora, Web as Corpus, Processing Raw Text, SQL	NLTK Chapter 1; sqlitebrowser (DB Browser); SQLite tutorial NTU Multilingual Corpus; Databases (zipped): English, Chinese, Wordnet, All of them in one zipfile.	Lab 3	Bobby Tables
6	Collocation, Frequency, Corpus Statistics	Dunning (1993) Corpus test Wizard Social Science Statistics	Project 1	Passives at the Language Log
7	Encoding, tokenization + CJK Corpora	Guest Lecture by Martina Jemelková	Lab 3 due Lab 4 intro
8	Lexical and Grammatical Studies, Variation	Biber et al. (1998) Chapters 2, 3
9	Case studies: Pronouns and Classifiers	Bond et al. (1995), Bond (2005), Seah and Bond (2014)	Lab 4 due Project 1 due
10	Corpora and Language Engineering	Corpus and Text: Basic Principles (John Sinclair 2005): Chapter 1 of Wynne (2005)	Project 2 Description
11	Conclusions and Review	Stubbs 9	Project 2 due on May 24th
Extra Case Studies	• Contrastive and Diachronic Studies	Biber, D., S. Conrad & R. Reppen (1998), Xiao, Richard, Lianzhen He and Ming Yue (2008)

Slides may be updated at any time! Labs may also change.

Online Corpus

We will use the English Corpora: english-corpora.org produced by Mark Davies.
they use the CLAWS 7 tagset.
Register a profile using your university email account: www.english-corpora.org/profile_new.asp.
Log in with your account, and select your institution as "Univerzita Palackého v Olomouci".

Assessment (COR)

Individual Lab work (40% = 4 x 10%)
- Lab 1 Searching the BNC
- Lab 2 Describing a Corpus
- Lab 3 Counting with SQL
- Lab 4 Designing/enhancing a Corpus
Project One (Individual: 20%)
Write a 6 page paper (ACL format) using a corpus to describe some linguistic phenomenon quantitatively. The paper must motivate both the choice of phenomenon and the corpus used to study it.
Project Two (Group: 30%)
Choice of one of two tasks
- A program with documentation to perform some substantial corpus processing task
  Deliverable: Program output and evaluation metric along with commented program
- The collection and annotation of a new (sub)corpus
  Deliverable: Annotated corpus with tagging guidelines
- Each task should be accompanied by an 8 page paper (ACL format with extra pages for references) describing your approach
Class Participation (10%)
Extra Credit (up to 5%)
- If you submit a patch that gets accepted to a corpus or tool we use
  - you can get 1-5% extra credit (depending on the size/difficulty)
  - typically 10ⁿ⁻¹ where n is the number of lines you changed
  - you can’t go over 100%
- A patch can involve
  - extending the corpus/code with new capabilities
  - fixing a bug in annotation/code
  - fixing a bug in or extending documentation
  - fixing a spelling error; rewording for clarity; translating to a new language
  - Has to be for this course (not overlap with URECA, project, HG2051, . . . )

Learning Outcomes

On completion of this module, students should be able to:

Understand the uses of text corpora in language research
Be able to manipulate them with simple tools
Use a concordance program to extract data from a corpus
Design and build a corpus for some task
Understand how to analyse corpus data through basic statistical methods
Perform simple SQL queries

The raw material for these slides is hosted at https://github.com/bond-lab/Corpus-Linguistics

This course was originally developed at NTU as HG3051: Corpus Linguistics and co-taught with the post-graduate course HG7032: Topics in Corpus Linguistics.

Francis Bond <bond@ieee.org> <francis.bond@upol.cz>
Home page