LTI: Language, Technology and the Internet
Francis
Bond
:
2010, 2012, 2014, 2019, 2020, 2021 (as HG2052), 2023
Wednesday
09:45-11:15,
Room 2.40, tř. Svobody 26,
779 00 Olomouc
This course explores the intersection of language, technology and the
internet. We start by looking at the introduction of writing and how
it can be represented on computers. We then look at speech and how it
differs from text. These are compared to different media of
communication such as email, blogs and chat. The internet has made
new methods of writing possible, as well as made access to an
incredible variety and amount of text. We study how wikipedia pages
are written, and students group together to write their own pages. In
the second half of the course, we learn about how information is
represented electronically, both as text and meta-text. We finish with a discussion of large language models and AI. The implications of this technology for our thinking and understanding of language will also be discussed.
There is no set text-book, all the material is covered in the
lectures. As a result, you need to actually come to the lectures.
General guidelines to the course are given in lecture one.
Course
Page: Course Page (here);
Source: Source on Github
Course Outline
Week |
Content (click to download) |
Further Reading |
Misc |
1 |
Introduction, Organization: Main Issues |
What can search terms tell us?
Which is more efficient: Chinese or English
Rants about technology through the ages |
Media Usage Form
Media
Usage Diary (sample)
Results: 2014,
2019,
2020,
2021,
2023
|
2 |
Writing and Text |
Sproat (2010) Ch 3 |
Introduce Assignment 1
Selected papers from previous years.
|
---|
3 |
Speech and Language Technology |
Sproat (2010, Ch 6) |
Festival TTS
Mouth
Bot
Readspeaker Demo
|
4 |
New Mediums: Email, Usenet, Blogs and Chat |
Crystal (2006, Ch 3–6) and some from this class.
| Q&A with David Crystal on Internet Linguistics |
5 |
Wikis and Collaboration |
Wikipedia:About
| Assignment 2 Ex 0 (12th) Make a wiki account and user page
Multilingual Consent Form |
6 |
The World Wide Web and HTML |
Crystal (2006) Ch 7
| Assignment 1 Due: Oct 27 17:00
Assignment 2 Ex 1: Improve a page Pick your group and topic
|
7 |
The Web as Corpus |
Kilgarriff (2004); Kilgariff and Grefenstette (2003)
Google's Book Search: A Disaster for Scholars (Nunberg, 2009) |
|
8 |
Text and Meta-text |
Marcus, Santorini and Marcinkiewicz (2004)
A Gentle Introduction to Metadata (Jeff Good, 2002) |
Assignment
2 Presentation Nov 8 |
9 |
Language Identification and Normalization |
Manning and Schütze (1999, Ch 3)
Generalized Language Identification (Marco Lui, 2014) |
|
10 |
Citation, Reputation and PageRank |
Brin and Page (1998)
|
11 |
AI and Large Language Models I |
|
Assignment
2 Due on Friday Dec 1 |
12 |
AI and Large Language Models II |
|
|
13 |
Review and Conclusions |
|
Assignment 3 Due January 25th (midnight) |
Recommended Readings
- Tim Berners-Lee, James Hendler and Ora Lassila (2001).
The semantic web.
Scientific American pages 29-37.
- Sergey Brin and Lawrence Page (1998).
The
anatomy of a large-scale hypertextual web search engine.
In Seventh International World-Wide Web Conference (WWW 1998).
- David Crystal (2006). Language and the Internet.
Cambridge University Press, 2nd edition.
- Markus Dickinson, Chris Brew and Detmar Meurers (2013)
Language
and Computers. Wiley-Blackwell
- Adam Kilgariff and Gregory Grefenstette, editors (2003).
Web as Corpus:
Special issue of Computational Linguistics. Vol 29 no 3. ACL.
- Adam Kilgarriff (2004). Web as corpus.
In Sampson and McCarthy (2004), chapter 42,
pages 471–473.
- Andrey Kolmogorov (1968)
Three approaches to the quantitative definition of information
in International Journal of Computer Mathematics.
- Marcus, M. P., Santorini, B., and Marcinkiewicz, M. A. (2004).
Building a large annotated corpus of English: The Penn treebank.
In Sampson and McCarthy (2004), chapter 21, pages 242–257.
- Chris Manning and Hinrich Schütze (1999)
Foundations of
Statistical Natural Language Processing,
MIT Press. Cambridge, MA.
- Geoffry Sampson and Diana McCarthy, editors (2004).
Corpus Linguistics: Readings in a Widening Discipline. Continuum.
- Shadbolt, N., Hall, W., and Berners-Lee, T. (2006).
The
semantic web revisited.
IEEE Intelligent Systems, pages 1541–1672.
- Claude E. Shannon (1948)
A
Mathematical Theory of Communication,
Bell System Technical Journal, 27, pp. 379–423 & 623–656, July & October, 1948.
- Richard Sproat, (2010). Language, Technology, and Society.
Oxford University Press.
- LANGUAGE@INTERNET "an open-access, peer-reviewed, scholarly electronic journal that publishes original research on language and language use mediated by the Internet, the World Wide Web, and mobile technologies." (Now an example of link rot)
Assessment
- Assignment 1 (20%);
- Describe a different modality of communication
and compare it to speech and text
- Assignment 2 (40%) Group Work
- Create or enhance a wikipedia page about linguistics
- Using Wikipedia to
Re-envision the Term Paper
- As part of this you will present the page you are planning
to work on to the class
- explain why this subject is notable
- show the current state of the page and how
you plan to improve it
- outline the sources you plan to use
- Aim to present for around 10 minutes
- Any subset of the group can present
- Either prepare slides (recommended) or talk over the
wiki page
- You do not have to have started changing the page
- At the end of this assignment the page should be a good page
- Assignment 3 (30%)
- Create a linguistic resource using LLMs
- Classroom participation (10%)
- Please follow the (Computational) Linguistic Style Guidelines: a guide for the flummoxed
Source code for this course available
here https://github.com/bond-lab/Language-Technology-and-the-Internet
under
a Creative
Commons Attribution 4.0 International Licence — CC BY 4.0.
Francis Bond
<bond@ieee.org>
Palacký University