COR: Corpus Linguistics
Francis Bond,
2011, 2012, 2014, 2018, 2020, 2024.
This course is an introduction to the fast growing field of corpus
linguistics. It aims to familiarise students with key concepts and
common methods used in the construction of language corpora, as well
as tools that have been developed for searching and using major
corpora such as the British National Corpus, Czech National Corpus,
and CJK corpora. Students will be given hands-on experience in
pre-editing, annotating, and searching corpora. Criteria and methods
used for evaluating corpora and analytical tools will also be
discussed. This lays the groundwork for research using big data.
The main aim of this module is to master the uses of text corpora in linguistics research and data analysis.
Course Content
This course introduces basic corpus skills for linguists:
- Marking up linguistic information
- Selecting text
- The range of existing corpora
- How to build your own corpus
- Simple database queries with SQL
- Using corpora to test linguistic hypotheses
- Using corpora to train language tools
Course page (here);
Source on Github
There is no text book, readings will be assigned each week.
Course Outline
Lecture |
Topic |
Readings/Extra Information/Tools |
Assessment |
Fun |
1 |
Basic Concepts, What can we do with Corpora? |
Corpus and Text: Basic Principles (Sinclair 2005)
in Wynne (2005)
|
|
|
2 |
Markup and Annotation |
Adding Linguistic Annotation (Geoffrey Leech 2005)
and Metadata for Corpus Work (Lou Burnard 2005)
in Wynne (2005)
NTU-MC Tagsets:
cmn;
eng;
jpn;
ind;
universal;
universal (old version);
|
Lab 1
|
Phonetic Punctuation
Victor Borge
|
3 |
Representativeness and
Balance
Multimodal and Multilingual Corpora
|
Koehn (2005);
Martin et al (2007)
and Character Encoding in Corpus Construction (Anthony McEnery and Richard Xiao 2005)
in Wynne (2005)
|
Discuss the results of the tasks
Lab 1 due
Email corpus choice for Lab 2
|
|
4 |
A survey of Available Corpora |
Various Corpora
・Open Korean Corpora
・Balanced Corpus of Contemporary Written Japanese (BCCWJ)
|
Present Lab 2
|
|
5 |
DIY Corpora, Web as Corpus, Processing Raw Text, SQL |
NLTK Chapter 1;
sqlitebrowser (DB Browser); SQLite tutorial
NTU
Multilingual Corpus;
Databases (zipped):
English,
Chinese,
Wordnet,
All
of them in one zipfile.
|
Lab 3
|
Bobby Tables
|
6 |
Collocation, Frequency,
Corpus Statistics
|
Dunning (1993)
Corpus test
Wizard
Social Science Statistics
|
Project 1 |
Passives at the Language Log |
7 |
Encoding, tokenization + CJK Corpora
|
Guest Lecture by Martina Jemelková |
Lab 3 due
Lab 4 intro |
|
8 |
Lexical and Grammatical Studies, Variation |
Biber et al. (1998) Chapters 2, 3 |
|
|
9 |
Case studies: Pronouns and Classifiers |
Bond et al. (1995), Bond (2005), Seah and Bond (2014) |
Lab 4 due
Project 1 due
|
|
10 |
Corpora and Language
Engineering
|
Corpus and Text: Basic Principles
(John Sinclair 2005):
Chapter 1 of Wynne (2005) |
Project 2 Description
|
|
11 |
Conclusions and
Review
|
Stubbs 9 |
Project 2 due on May 24th |
|
Extra Case Studies |
• Contrastive and Diachronic Studies |
Biber, D., S. Conrad & R. Reppen (1998),
Xiao, Richard, Lianzhen He and Ming Yue (2008)
|
|
Slides may be updated at any time! Labs may also change.
Online Corpus
Recommended Readings
- Biber, D., S. Conrad & R. Reppen (1998), Corpus Linguistics: Investigating Language Structure and Use. Cambridge University Press.
- Bond, Francis (2005)
Translating the Untranslatable: A Solution to the
Problem of Generating English Determiners
CSLI Studies in Computational Linguistics, CSLI Publications,
Stanford.
- Bond, Francis, Kentaro Ogura and Satoru
Ikehara (1995a)
Possessive pronouns as determiners in Japanese-to-English machine
translation.
In 2nd Pacific Association for Computational Linguistics
Conference: PACLING-95, 32–38, Brisbane.
(cmp-lg/9601006).
- Stephen Bird, Ewan Klein, Edward Loper (2009)
Natural Language Processing with Python, O'Reilly.
NLTK Natural Language Toolkit
- Dunning, Ted
(1993) Accurate
methods for the statistics of surprise and
coincidence. Computational Linguistics. 19, 1 (March 1993),
61-74.
- Nancy Ide, Catherine Macleod (2001). The American National Corpus: A Standardized Resource of American English. Proceedings of Corpus Linguistics, Lancaster UK.
- Kennedy, G. An Introduction to Corpus Linguistics. Longman, 1998.
- Wynne, Martin
(editor). 2005. Developing
Linguistic Corpora: a Guide to Good Practice. Oxford:
Oxbow Books. [Copied
from https://ota.bodleian.ox.ac.uk/repository/xmlui/handle/20.500.12024/2951
on 2020-01-02, under the license: (CC BY-NC-SA 3.0)].
- Koehn, Philipp (2005) Europarl: A Parallel Corpus for Statistical Machine Translation, MT Summit 2005
- Leech, Geoffrey and Nicholas Smith, Manual to accompany The British National Corpus (Version 2) with Improved Word-class Tagging. University of Lancaster, 2000. Retrieved 2011-02-04 from http://ucrel.lancs.ac.uk/bnc2/bnc2postag_manual.htm.
- Martin, Jean-Claude, Patrizia
Paggio, Peter Kuehnlein, Rainer Stiefelhagen and Fabio Pianesi
(2007) Introduction to the special issue of the Journal Language Resources and Evaluation Multimodal Corpora for Modeling Human Multimodal Behaviour, vol. 41, no. 3-4 / December, 2007
- McEnery, Tony et al (2006) Corpus-Based Language Studies: An Advanced Resource Book. Routledge.
- McEnery, Tony and Andrew Wilson (2001) Corpus Linguistics 2nd ed, Edinburgh UP.
- Paul Newman (2007) Copyright Essentials for Linguists
Language Documentation & Conservation 1(1)
- Sinclair, John (1991) Corpus Concordance Collocation Oxford: Oxford UP.
- Seah, Yu Jie and Francis Bond (2014)
Annotation
of Pronouns in a Multilingual Corpus of Mandarin Chinese, English and Japanese
In 10th Joint ACL - ISO Workshop on Interoperable Semantic Annotation
Reykjavik
- Stubbs, Michael (1996) Text and Corpus Analysis Blackwell Publishers
- Xiao, Richard, Lianzhen He and Ming Yue
(2008) Proceedings of The International Symposium on Using
Corpora in Contrastive and Translation Studies (UCCTS 2008), Alberta
Projects that became papers
Assessment (COR)
- Individual Lab work (40% = 4 x 10%)
- Lab 1 Searching the BNC
- Lab 2 Describing a Corpus
- Lab 3 Counting with SQL
- Lab 4 Designing/enhancing a Corpus
- Project One (Individual: 20%)
Write a 6 page paper (ACL format) using a corpus to describe
some linguistic phenomenon quantitatively. The paper must
motivate both the choice of phenomenon and the corpus used to
study it.
- Project Two (Group: 30%)
Choice of one of two tasks
- A program with documentation to perform some substantial
corpus processing task
Deliverable:
Program output and evaluation metric along with commented program
- The collection and annotation of a new (sub)corpus
Deliverable:
Annotated corpus with tagging guidelines
- Each task should be accompanied by an 8 page paper
(ACL format with extra pages for references) describing your
approach
- Class Participation (10%)
- Extra Credit (up to 5%)
- If you submit a patch that gets accepted to a corpus or
tool we use
- you can get 1-5% extra credit (depending on the size/difficulty)
- typically 10n−1 where n is the number of lines you changed
- you can’t go over 100%
- A patch can involve
- extending the corpus/code with new capabilities
- fixing a bug in annotation/code
- fixing a bug in or extending documentation
- fixing a spelling error; rewording for clarity; translating to a new
language
- Has to be for this course (not overlap with URECA, project, HG2051, . . . )
Learning Outcomes
On completion of this module, students should be able to:
- Understand the uses of text corpora in language research
- Be able to manipulate them with simple tools
- Use a concordance program to extract data from a corpus
- Design and build a corpus for some task
- Understand how to analyse corpus data through basic statistical methods
- Perform simple SQL queries
The raw material for these slides is hosted at https://github.com/bond-lab/Corpus-Linguistics
This course was originally developed at NTU as HG3051: Corpus
Linguistics and co-taught with the post-graduate
course HG7032: Topics in Corpus Linguistics.
Francis Bond
<bond@ieee.org>
<francis.bond@upol.cz>
Home page