8. Regular Expressions

Lecture notes

Do the Interactive regular expression tutorial
Then read Wikiversity Python Programming/RegEx to see how to use them in Python

Before class (code, output)

Use RegEx101 to test some regular expressions. Make sure you tick 'Python' as the flavor.
- Put some text in the text string, and then try changing the regular expression in the top area, and see what parts of the text are matched. For example, what happens if you add "\b" at the beginning of the regular expression? And at the end? Why?
- You can click substitutions, if you want to test substitutions as well.
Describe the class of strings matched by the following regular expressions:
1. [a-zA-Z]+
2. [A-Z][a-z]*
3. \bp[aeiou]{,2}t\b
4. \d+(\.\d+)?
5. ([^aeiou][aeiou][^aeiou])*
6. \w+|[^\w\s]+
Write a function extract_dates(text) that takes a string as input and returns a list of all dates in the format dd/mm/yyyy.
text = "My birthday is on 25/12/1990 and my friend's is on 01/01/2000." extract_dates(text) # Output: ['25/12/1990', '01/01/2000']
Write a function normalize_spaces(text) that takes a string with multiple spaces between words and replaces them with a single space. Return the modified string.
text = "This is an example sentence." normalize_spaces(text) # Output: "This is an example sentence."
Write a function mask_digits(text) that takes a string and replaces every digit with the # symbol. Return the modified string.
text = "My phone number is 123-456-7890." mask_digits(text) # Output: "My phone number is ###-###-####."
Write a function count_word_occurrences(text, word) that takes a string text and a string word, and returns the number of times word appears as a standalone word (i.e., surrounded by word boundaries) in the text. The match should be case-insensitive.
>>> count_word_occurrences("This is an example. This is fun!", "this") 2 >>> count_word_occurrences("pen, pineapple, apple-pen and pineapple-pen", "apple") 1 >>> count_word_occurrences("Nothing to see here.", "banana") 0
- Hint use the special metacharacter \b which matches the boundary between a word and a non-word character (described here)
If you need a break, return to 2016 and watch PPAP (Pen-Pineapple-Apple-Pen).

Practical work --- in class (code, output)

Use regular expression substitution instead of replace for swear_filter
Use word boundary matching and the ignore case flag to make it more robust.
Take a look at the following (from the Korean Duowiki):
```
    
안녕 (annyeong) = hi/bye (informal)
안녕하세요 (annyeonghaseyo) = hello (polite)
안녕하십니까 (anyeonghasimnikka) = hello (formal)
만나서 반갑습니다 (mannaseo bangapseumnida) = nice to meet you
저 (jeo) = I, me
제 (je) = my = 저의
  
```
- Use regular expressions to extract as much information as you can!
- Break different tasks down into different functions.
- In real world data, there will often be patterns that only appear a few times (or even just once).
  In these cases, you can chose to ignore them and discard that data
  Normally we would keep a record of what we throw away, typically in a file called something.log

Further Practice --- at home (code, output)

In Chinese, titles of books are often written between full-width double angle-brackets: 《...》. If there is a transltion it will often be between simple parentheses.
e.g., 《Pride and Prejudice》 (傲慢与偏见)
Write a function get_trans() that takes text as input and gives a list of [(title, translation), ...] as output.
In Japanese, a lot of onomatopoeia (sound symbolism) takes the form of two morae repeated, like ピカピカ "pika-pika". Write a function that looks for example likes that. and test it with:
"空がピカピカ光っていたし、心臓がドキドキしていました。彼はニコニコ笑っていました。"
Are there any examples in the Japanese duowiki data: https://bond-lab.github.io/Language-and-the-Computer/code/duowiki/vocab-Japanese.txt?
Download it and see, ...

LAC: Language and the Computer Francis Bond.

8. Regular Expressions

Lecture notes

Further reading

Before class (code, output)

Practical work --- in class (code, output)

Further Practice --- at home (code, output)