8. Regular Expressions
Lecture notes
Further reading
-
Use RegEx101 to test some
regular expressions. Make sure you tick 'Python' as the flavor.
-
Put some text in the text string, and then try changing the
regular expression in the top area, and see what parts of the
text are matched. For example, what happens if you add "\b" at
the beginning of the regular expression? And at the end? Why?
-
You can click substitutions, if you want to test substitutions as well.
- Describe the class of strings matched by the following regular
expressions:
-
[a-zA-Z]+
-
[A-Z][a-z]*
-
\bp[aeiou]{,2}t\b
-
\d+(\.\d+)?
-
([^aeiou][aeiou][^aeiou])*
-
\w+|[^\w\s]+
- Write a function
extract_dates(text)
that takes a string as input and returns a list of all dates in the format dd/mm/yyyy.
text = "My birthday is on 25/12/1990 and my friend's is on 01/01/2000."
extract_dates(text) # Output: ['25/12/1990', '01/01/2000']
- Write a function
normalize_spaces(text)
that takes a string with multiple spaces between words and replaces them with a single space. Return the modified string.
text = "This is an example sentence."
normalize_spaces(text) # Output: "This is an example sentence."
- Write a function
mask_digits(text)
that takes a string and replaces every digit with the # symbol. Return the modified string.
text = "My phone number is 123-456-7890."
mask_digits(text) # Output: "My phone number is ###-###-####."
- Write a function count_word_occurrences(text, word) that takes a string text and a string word, and returns the number of times word appears as a standalone word (i.e., surrounded by word boundaries) in the text. The match should be case-insensitive.
>>> count_word_occurrences("This is an example. This is fun!", "this")
2
>>> count_word_occurrences("pen, pineapple, apple-pen and pineapple-pen", "apple")
1
>>> count_word_occurrences("Nothing to see here.", "banana")
0
- Hint use the special metacharacter \b which matches the boundary between a word and a non-word character (described here)
- If you need a break, return to 2016 and watch PPAP (Pen-Pineapple-Apple-Pen).
Practical work --- in class (code,
output)
Further Practice --- at home (code,
output)
- In Chinese, titles of books are often written between
full-width double angle-brackets: 《...》. If there is a transltion
it will often be between simple parentheses.
e.g., Pride and Prejudice》 (傲慢与偏见)
Write a function get_trans()
that takes text as input and gives a list of [(title, translation), ...]
as output.
- In Japanese, a lot of onomatopoeia (sound symbolism) takes the
form of two morae repeated, like ピカピカ "pika-pika". Write a
function that looks for example likes that. and test it with:
"空がピカピカ光っていたし、心臓がドキドキしていました。彼はニコニコ笑っていました。"
- Are there any examples in the Japanese duowiki data: https://bond-lab.github.io/Language-and-the-Computer/code/duowiki/vocab-Japanese.txt?
Download it and see, ...
LAC: Language and the Computer Francis Bond.