COR
Lab 4: Designing a Corpus
Due before lecture 9.
Imagine that you have been asked to create a corpus of a particular
language, or a specialized corpus (by register, historical period,
topic, etc), or to extend an existing corpus with
new information. Write a brief 2–4 page outline (any format: but
submit as PDF and make sure you include your name) in which you take
into account the following issues and features, and why you have made
the decisions that you have.
You do not actually have to build this corpus.
You may wish to make this the basis of the corpus you build or extend in
Project Two.
Upload the final lab report as pdf
It should be called lab4-yournamesurname.pdf
- Is it an archive, an electronic text library, a corpus, or a sub-corpus?
- What types of written and/or spoken texts will be in the corpus?
- More specifically, briefly discuss the following characteristics
of your corpus: mode, text origin, constitution, medium, style,
topic, date, and author(s).
- Estimate the cost (time, person-hours) to construct the corpus
- How will you distribute it to others?
- What types of annotation will there be (tagging, text identification, etc)?
- More specifically, what information about each text will be included in the header, index, or source files?
- Will it be grammatically tagged? Why or why not?
- How will you handle the following types of text features: (for
written) non-ascii characters, quotations, lists, headings, proper
names, and pagination; (for spoken) speaker change, syntax,
accent/dialect, interruptions, pauses, and inaudible segments?
- What are some copyright/ethical problems that you might face? How will you deal with these?
- Not that you cannot record someone without getting their permission beforehand
— it is definitely unethical and probably illegal
- How representative will your corpus be of the entire population (i.e. all possible texts)? What means will you take to create a representative corpus?
- Who will be the main users of your corpus? What types of information will they likely be looking for?
COR (Corpus Linguistics) main page.
Francis Bond
<bond@ieee.org>
<francis.bond@upol.cz>
Home page