Developing Linguistic
Corpora:
a Guide to Good
Practice
Preface
Martin
Wynne
A linguistic corpus is a
collection of texts which
have been selected and
brought together so that
language can be studied on
the computer. Today, corpus
linguistics offers some of
the most powerful new
procedures for the analysis
of language, and the impact
of this dynamic and
expanding sub-discipline is
making itself felt in many
areas of language study.
In this volume, a selection
of leading experts in
various key areas of corpus
construction offer advice in
a readable and largely
non-technical style to help
the reader to ensure that
their corpus is well
designed and fit for the
intended purpose.
This Guide is aimed at those
who are at some stage of
building a linguistic
corpus. Little or no
knowledge of corpus
linguistics or computational
procedures is assumed,
although it is hoped that
more advanced users will
also find the guidelines
here useful. It also has
relevance for those who are
not building a corpus, but
who need to know something
about the issues involved in
the design of corpora in
order to choose between
available resources and to
help draw conclusions from
their analysis.
Increasing numbers of
researchers are seeing the
potential benefits of the
use of an electronic corpus
as a source of empirical
language data for their
research. Until now, where
did they find out about how
to build a corpus? There is
a great deal of useful
information available which
covers principles of corpus
design and development, but
it is dispersed in
handbooks, reports,
monographs, journal articles
and sometimes only in the
heads of experienced
practitioners. This Guide is
an attempt to draw together
the experience of corpus
builders into a single
source, as a starting point
for obtaining advice and
guidance on good practice in
this field. It aims to bring
together some key elements
of the experience learned,
over many decades, by
leading practitioners in the
field and to make it
available to those
developing corpora today.
The modest aim of this Guide
is to take readers through
the basic first steps
involved in creating a
corpus of language data in
electronic form for the
purpose of linguistic
research. While some
technical issues are
covered, this Guide does not
aim to offer the latest
information on digitisation
techniques. Rather, the
emphasis is on the
principles, and readers are
invited to refer to other
sources, such as the latest
AHDS information papers, for
the latest advice on
technologies. In addition to
the first chapter on the
principles of corpus design,
Professor Sinclair has also
provided a more practical
guide to building a corpus,
which is added as an
appendix to the Guide. This
should help guide the user
through some of the more
specific decisions that are
likely to be involved in
building a corpus.
Alert readers will see that
there are areas where the
authors are not in accord
with each other. It is for
the reader to weigh up the
advantages of each approach
for his own particular
project, and to decide which
course to follow. This Guide
does not aim to synthesize the
advice offered by the
various practitioners into a
single approach to creating
corpora. The information on
good practice which is
sampled here comes from a
variety of sources,
reflecting different
research goals, intellectual
traditions and theoretical
orientations. The individual
authors were asked to state
their opinion on what they
think is the best way to
deal with the relevant
aspects of developing a
corpus, and neither the
authors nor the editor have
tried to hide the
differences in approaches
which inevitably exist. It
is anticipated that readers
of this document will have
differing backgrounds, will
have very diverse aims and
objectives, will be dealing
with a variety of different
languages and varieties, and
that one single approach
would not fit them all.
I would like to thank the
authors of this volume for
their goodwill and support
to this venture, and for
their patience through the
long period it has taken to
bring the Guide to
publication. I would like to
acknowledge the extremely
helpful advice and editorial
work from my colleague Ylva
Berglund, which has improved
many aspects of this
guide.