LTI: Language, Technology and the Internet
Francis
Bond
:
2010, 2012, 2014, 2019, 2020, 2021 (as HG2052), 2023, 2024
Thursday
16:45-18:15,
Room 2.39, tř. Svobody 26,
779 00 Olomouc
This course explores the intersection of language, technology and the
internet. We start by looking at the introduction of writing and how
it can be represented on computers. We then look at speech and how it
differs from text. These are compared to different media of
communication such as email, blogs and chat. The internet has made
new methods of writing possible, as well as made access to an
incredible variety and amount of text. We study how wikipedia pages
are written, and students will write their own pages. In
the second half of the course, we learn about how information is
represented electronically, both as text and meta-text. We finish with a discussion of large language models and AI. The implications of this technology for our thinking and understanding of language will also be discussed.
There is no set text-book, all the material is covered in the
lectures. As a result, you need to actually come to the lectures.
General guidelines to the course are given in lecture one.
Course
Page: Course Page (here);
Source: Source on Github
Course Outline
Wk |
Date |
Content (click to download) |
Further Reading |
Misc |
1 |
09-26 |
Introduction, Organization: Main Issues |
What can search terms tell us?
Which is more efficient: Chinese or English
Rants about technology through the ages |
Media Usage Form
Media
Usage Diary (sample)
Results: 2014,
2019,
2020,
2021,
2023,
2024
|
2 |
10-03 |
Writing and Text |
Sproat (2010) Ch 3 |
Introduce Assignment 1
Selected papers from previous years.
|
---|
3 |
10-10 |
Speech and Language Technology |
Sproat (2010, Ch 6) |
Festival TTS
Mouth
Bot
Readspeaker Demo
|
4 |
10-17 |
New Mediums: Email, Usenet, Blogs and Chat |
Crystal (2006, Ch 3–6) and some from this class.
| Q&A with David Crystal on Internet Linguistics |
5 |
10-24 |
Wikis and Collaboration |
Wikipedia:About
| Assignment 2 Ex 0 (12th) Make a wiki account and user page
Multilingual Consent Form |
6 |
10-31 |
The World Wide Web and HTML |
Crystal (2006) Ch 7
| Assignment 1 Due: Nov 8 17:00 CET
Assignment 2 Ex 1: Improve a page Pick your topic
|
Reading Week |
7 |
11-14 |
The Web as Corpus |
Kilgarriff (2004); Kilgariff and Grefenstette (2003)
Google's Book Search: A Disaster for Scholars (Nunberg, 2009) |
|
8 |
11-21 |
Text, Meta-text and Trust |
Marcus, Santorini and Marcinkiewicz (2004)
A Gentle Introduction to Metadata (Jeff Good, 2002) |
|
9 |
Short break |
Watch: Simon Willison on AI (PyCon 2024 Keynote) |
10 |
12-05 |
AI and Large Language Models I |
|
Assignment
2 Presentation |
11 |
12-12 |
AI and Large Language Models II |
Practical: Let's try to side quest with AI!
Notes on the new Claude analysis JavaScript code execution tool
|
|
12 |
12-19 |
AI and Large Language Models III |
|
Assignment
2 Due on January 10, 2025 |
13 |
- |
Review and Conclusions |
|
Assignment 3 Due January 31, 2025 |
Recommended Readings
- Tim Berners-Lee, James Hendler and Ora Lassila (2001).
The semantic web.
Scientific American pages 29-37.
- Sergey Brin and Lawrence Page (1998).
The
anatomy of a large-scale hypertextual web search engine.
In Seventh International World-Wide Web Conference (WWW 1998).
- David Crystal (2006). Language and the Internet.
Cambridge University Press, 2nd edition.
- Markus Dickinson, Chris Brew and Detmar Meurers (2013)
Language
and Computers. Wiley-Blackwell
- Adam Kilgariff and Gregory Grefenstette, editors (2003).
Web as Corpus:
Special issue of Computational Linguistics. Vol 29 no 3. ACL.
- Adam Kilgarriff (2004). Web as corpus.
In Sampson and McCarthy (2004), chapter 42,
pages 471–473.
- Andrey Kolmogorov (1968)
Three approaches to the quantitative definition of information
in International Journal of Computer Mathematics.
- Marcus, M. P., Santorini, B., and Marcinkiewicz, M. A. (2004).
Building a large annotated corpus of English: The Penn treebank.
In Sampson and McCarthy (2004), chapter 21, pages 242–257.
- Chris Manning and Hinrich Schütze (1999)
Foundations of
Statistical Natural Language Processing,
MIT Press. Cambridge, MA.
- Geoffry Sampson and Diana McCarthy, editors (2004).
Corpus Linguistics: Readings in a Widening Discipline. Continuum.
- Shadbolt, N., Hall, W., and Berners-Lee, T. (2006).
The
semantic web revisited.
IEEE Intelligent Systems, pages 1541–1672.
- Claude E. Shannon (1948)
A
Mathematical Theory of Communication,
Bell System Technical Journal, 27, pp. 379–423 & 623–656, July & October, 1948.
- Richard Sproat, (2010). Language, Technology, and Society.
Oxford University Press.
- LANGUAGE@INTERNET "an open-access, peer-reviewed, scholarly electronic journal that publishes original research on language and language use mediated by the Internet, the World Wide Web, and mobile technologies." (Now an example of link rot)
Assessment
- Assignment 1 (20%);
- Describe a different modality of communication
and compare it to speech and text
- Assignment 2 (40%)
- Create or enhance a wikipedia page about linguistics
- Using Wikipedia to
Re-envision the Term Paper
- As part of this you will present the page you are planning
to work on to the class
- explain why this subject is notable
- show the current state of the page and how
you plan to improve it
- outline the sources you plan to use
- Aim to present for around 10 minutes
- Either prepare slides (recommended) or talk over the
wiki page
- You do not have to have started changing the page
- At the end of this assignment the page should be a good page
- Assignment 3 (30%)
- Create a linguistic resource using LLMs
- Classroom participation (10%)
- Please follow the (Computational) Linguistic Style Guidelines: a guide for the flummoxed
Source code for this course available
here https://github.com/bond-lab/Language-Technology-and-the-Internet
under
a Creative
Commons Attribution 4.0 International Licence — CC BY 4.0.
Francis Bond
<bond@ieee.org>
Palacký University