12. Summary and Review

For revision: reread chapters 1-6, redo the tutorial problems.

NLTK Chapters: 1-6

Sequences

Strings, lists and tuples are different kinds of sequence object, supporting common operations such as indexing, slicing, finding the length, sorting, and membership testing using in.
Texts can be represented in Python using lists: ['Monty', 'Python']. We can use indexing, slicing, and the len() function on lists.
- list = ['Monty', 'Python']
- list[0] = 'Monty'
- list[0:1] = ['Monty']
- len(list) = 2
A string is specified in Python using single or double quotes: 'Monty Python', "Monty Python".
- string = "string"
- string[0] = "s"
- string[1:3] = "rin"
- len(string) = 5
Strings have some special tests associated with them:
- string.startswith('substr')
- string.endswith('substr')
- string.isalpha()
- string.islower()
- string.istitle()
- string.isupper()
Strings can be split into lists: 'Monty Python'.split() gives ['Monty', 'Python'].
Lists can be joined into strings: '/'.join(['Monty', 'Python']) gives 'Monty/Python'.
A dictionary is used to map between arbitrary types of information, such as a string and a number:
freq['cat'] = 12.
We create dictionaries using the brace notation:
pos = {}
pos = {'furiously': 'adv', 'ideas': 'n', 'colorless': 'adj'}
A tuple consists of a number of values separated by commas:
t = 12345, 54321, 'hello!'
Tuples are normally enclosed in parentheses ()
List comprehensions provide a concise way to create lists. The resulting list definition tends often to be clearer than lists built using those constructs. Each list comprehension consists of an expression followed by a for clause, then zero or more for or if clauses. The result will be a list resulting from evaluating the expression in the context of the for and if clauses which follow it. If the expression would evaluate to a tuple, it must be parenthesized.
to get nouns from a tagged corpus: nouns = [ w for (w, t) in tagged_text if t.startswith('N')]
A string formatting expression template % arg_tuple consists of a format string template that contains conversion specifiers like %-6s and %0.2d.
print "Tokens = %d; Types = %d; Tokens/Type = %0.2d" % (len(words), len(set(words), 1.0*len(words)/len(set(words)))

Iterating over sequences

Python Expression	Comment
for item in s	iterate over the items of s
for item in sorted(s)	iterate over the items of s in order
for item in set(s)	iterate over unique elements of s
for item in reversed(s)	iterate over elements of s in reverse
for item in set(s).difference(t)	iterate over elements of s not in t
for item in random.shuffle(s)	iterate over elements of s in random order

List Methods

Operation	Result
`s[i] = x`	item `i` of `s` is replaced by `x`
s[i:j] = t	slice of s from i to j is replaced by the contents of the iterable t
del s[i:j]	same as s[i:j] = []
s[i:j:k] = t	the elements of s[i:j:k] are replaced by those of t
del s[i:j:k]	removes the elements of s[i:j:k] from the list
s.append(x)	same as s[len(s):len(s)] = [x]
s.extend(x)	same as s[len(s):len(s)] = x
s.count(x)	return number of i‘s for which s[i] == x
s.index(x[, i[, j]])	return smallest k such that s[k] == x and i <= k < j
s.insert(i, x)	same as s[i:i] = [x]
s.pop([i])	same as x = s[i]; del s[i]; return x
s.remove(x)	same as del s[s.index(x)]
s.reverse()	reverses the items of s in place
s.sort([cmp[, key[, reverse]]])	sort the items of s in place

String Methods

Method	Functionality
s.find(t)	index of first instance of string t inside s (-1 if not found)
s.rfind(t)	index of last instance of string t inside s (-1 if not found)
s.index(t)	like s.find(t) except it raises ValueError if not found
s.rindex(t)	like s.rfind(t) except it raises ValueError if not found
s.join(text)	combine the words of the text into a string using s as the glue
s.split(t)	split s into a list wherever a t is found (whitespace by default)
s.splitlines()	split s into a list of strings, one per line
s.lower()	a lowercased version of the string s
s.upper()	an uppercased version of the string s
s.title()	a titlecased version of the string s
s.strip()	a copy of s without leading or trailing whitespace
s.replace(t, u)	replace instances of t with u inside s

String Comparison Operators

Function	Meaning
s.startswith(t)	test if s starts with t
s.endswith(t)	test if s ends with t
t in s	test if t is contained inside s
s.islower()	test if all cased characters in s are lowercase
s.isupper()	test if all cased characters in s are uppercase
s.isalpha()	test if all characters in s are alphabetic
s.isalnum()	test if all characters in s are alphanumeric
s.isdigit()	test if all characters in s are digits
s.istitle()	test if s is titlecased (all words in s have have initial capitals)

Dictionary Methods

Example	Description
d = {}	create an empty dictionary and assign it to d
d[key] = value	assign a value to a given dictionary key
d.keys()	the list of keys of the dictionary
list(d)	the list of keys of the dictionary
sorted(d)	the keys of the dictionary, sorted
key in d	test whether a particular key is in the dictionary
for key in d	iterate over the keys of the dictionary
d.values()	the list of values in the dictionary
dict([(k1,v1), (k2,v2), ...])	create a dictionary from a list of key-value pairs
d1.update(d2)	add all items from d2 to d1
defaultdict(int)	a dictionary whose default value is zero

Frequency Distributions

NLTK's Frequency Distributions: commonly-used methods and idioms for defining, accessing, and visualizing distributions of frequency
Example	Description
`fdist = FreqDist(samples)`	create a frequency distribution containing the given samples
`fdist[sample] += 1`	increment the count for this sample
`fdist['monstrous']`	count of the number of times a given sample occurred
`fdist.freq('monstrous')`	frequency of a given sample
`fdist.N()`	total number of samples
`fdist.most_common(n)`	the `n` most common samples and their frequencies
`for sample in fdist:`	iterate over the samples
`fdist.max()`	sample with the greatest count
`fdist.tabulate()`	tabulate the frequency distribution
`fdist.plot()`	graphical plot of the frequency distribution
`fdist.plot(cumulative=True)`	cumulative plot of the frequency distribution
`fdist1 \|= fdist2`	update `fdist1` with counts from `fdist2`
`fdist1 < fdist2`	test if samples in `fdist1` occur less frequently than in `fdist2`

Conditional Frequency Distributions

NLTK's Conditional Frequency Distributions: commonly-used methods and idioms for defining, accessing, and visualizing a conditional frequency distribution of counters
Example	Description
`cfdist = ConditionalFreqDist(pairs)`	create a conditional frequency distribution from a list of pairs
`cfdist.conditions()`	the conditions
`cfdist[condition]`	the frequency distribution for this condition
`cfdist[condition][sample]`	frequency for the given sample for this condition
`cfdist.tabulate()`	tabulate the conditional frequency distribution
`cfdist.tabulate(samples, conditions)`	tabulation limited to the specified samples and conditions
`cfdist.plot()`	graphical plot of the conditional frequency distribution
`cfdist.plot(samples, conditions)`	graphical plot limited to the specified samples and conditions
`cfdist1 < cfdist2`	test if samples in `cfdist1` occur less frequently than in `cfdist2`

Control and Program Style

We process each word in a text using a for statement, such as for w in t: or for word in text:. This must be followed by the colon character and an indented block of code, to be executed each time through the loop.
We can also loop through things using a while loop.
while (n != 20):
We can also do this in a single command (a list comprehension). To derive the vocabulary, collapsing case distinctions and ignoring punctuation, we can write set([w.lower() for w in text if w.isalpha()]).
We test a condition using an if statement: if len(word) < 5:. This must be followed by the colon character and an indented block of code, to be executed only if the condition is true.
A function is a block of code that has been assigned a name and can be reused. Functions are defined using the def keyword, as in def mult(x, y); x and y are parameters of the function, and act as placeholders for actual data values.
A function is called by specifying its name followed by one or more arguments inside parentheses, like this: mult(3, 4), e.g., len(text1).
Python programs more than a few lines long should be entered using a text editor, saved to a file with a .py extension, and accessed using an import statement.
Python functions permit you to associate a name with a particular block of code, and re-use that code as often as necessary.
A function serves as a namespace: names defined inside a function are not visible outside that function, unless those names are declared to be global.
Some functions are not available by default, but must be accessed using Python's import statement.
To find out about some variable v, type help(v) in the Python interactive interpreter to read the help entry for this kind of object.
Python's assignment and parameter passing use object references; e.g. if a is a list and we assign b = a, then any operation on a will modify b, and vice versa.
The is operation tests if two objects are identical internal objects, while == tests if two objects are equivalent. This distinction parallels the type-token distinction.
It is good to test functions with specialized test data (unit tests) to check that they work.
To test + we make some data test = [(1, 1, 2), (1, 0, 1), (2, 2, 4)] for (a, b,c ) in test: if (a + b != c): print "test failed: %s + %s not equal to %s" % (a, b, c)

Lexical Resources

A word "token" is a particular appearance of a given word in a text; a word "type" is the unique form of the word as a particular sequence of letters. We count word tokens using len(text) and word types using len(set(text)).
We obtain the vocabulary of a text t using sorted(set(t)).
We operate on each item of a text using [f(x) for x in text].
A frequency distribution is a collection of items along with their frequency counts (e.g., the words of a text and their frequency of appearance).
A text corpus is a large, structured collection of texts. NLTK comes with many corpora, e.g., the Brown Corpus, nltk.corpus.brown.
Often you can access many views:
- nltk.corpus.brown.raw the whole corpus as one string
- nltk.corpus.brown.words the whole corpus tokenized into words
- nltk.corpus.brown.sents the whole corpus tokenized into words and split into sentences
- nltk.corpus.brown.tagged_words the whole corpus tokenized and tagged
- nltk.corpus.brown.tagged_sents the whole corpus tokenized and tagged and split into sents
Some text corpora are categorized, e.g., by genre or topic; sometimes the categories of a corpus overlap each other.
A conditional frequency distribution is a collection of frequency distributions, each one for a different condition. They can be used for counting word frequencies, given a context or a genre.
WordNet is a semantically-oriented dictionary of English, consisting of synonym sets or synsets and organized into a network.

Dealing with raw text

In this book we view a text as a list of words. A "raw text" is a potentially long string containing words and whitespace formatting, and is how we typically store and visualize a text.
We can read text from a file f using text = open(f).read().
We can read text from a URL u using text = urlopen(u).read().
We can write text to a file by opening the file for writing: file = open('output.txt', 'w'), then adding content to the file: file.write("Monty Python"), and finally closing the file: file.close().
We can iterate over the lines of a text file using for line in open(f):.
Texts found on the web may contain unwanted material (such as headers, footers, markup), that need to be removed before we do any linguistic processing.
Tokenization is the segmentation of a text into basic units or tokens such as words and punctuation. Tokenization based on whitespace is inadequate for many applications because it bundles punctuation together with words. NLTK provides an off-the-shelf tokenizer nltk.word_tokenize().
Lemmatization is a process that maps the various forms of a word (such as appeared, appears) to the canonical or citation form of the word, also known as the lexeme or lemma (e.g. appear).

Regular Expressions

Regular expressions are a powerful and flexible method of specifying patterns. Once we have imported the re module, we can use re.findall() to find all substrings in a string that match a pattern.
If a regular expression string includes a backslash, you should tell Python not to preprocess the string, by using a raw string with an r prefix: r'regexp'.
Regular expressions use meta-characters: . ^ $ * + ? { [ ] \ | ( )
When backslash is used before certain characters, e.g. \n, this takes on a special meaning (newline character); however, when backslash is used before regular expression wildcards and operators, e.g. \., \|, \$, these characters lose their special meaning and are matched literally.
Regular expressions can be used to discover knowledge through patterns in text:
「([non-ascii]+)」$([ \w]+)$ finds translation pairs
「自然言語処理」(natural language processing)
\b(\w+) such as (w+) finds hypernym-hyponyms
animals such as cats, dogs and raccoons

We can match and then see what we matched

      m=re.match(pattern, string)
      if m:
          print ('Match found: ', m.group())

Regular Expression Summary

Operator	Behavior
.	Wildcard, matches any character
^abc	Matches some pattern abc at the start of a string
abc$	Matches some pattern abc at the end of a string
[abc]	Matches one of a set of characters
[A-Z0-9]	Matches one of a range of characters
\|ing\|s	Matches one of the specified strings (disjunction)
*	Zero or more of previous item, e.g. a, [a-z] (also known as Kleene Closure)
+	One or more of previous item, e.g. a+, [a-z]+
?	Zero or one of the previous item (i.e. optional), e.g. a?, [a-z]?
{n}	Exactly n repeats where n is a non-negative integer
{n,}	At least n repeats
{,n}	No more than n repeats
{m,n}	At least m and no more than n repeats
a(b\|c)+	Parentheses that indicate the scope of the operators
([A-Z][a-z]+)\1	Parentheses also mark a match group

Regular Expression Predefined Sequences

Symbol	Function
\b	Word boundary (zero width)
\d	Any decimal digit (equivalent to [0-9])
\D	Any non-digit character (equivalent to [^0-9])
\s	Any whitespace character (equivalent to [ \t\n\r\f\v]
\S	Any non-whitespace character (equivalent to [^ \t\n\r\f\v])
\w	Any alphanumeric character (equivalent to [a-zA-Z0-9_])
\W	Any non-alphanumeric character (equivalent to [^a-zA-Z0-9_])
\t	The tab character
\n	The newline character
\1	The first match group (\2 is the second, ...)

Tagging and Classification

Words can be grouped into classes, such as nouns, verbs, adjectives, and adverbs. These classes are known as lexical categories or parts of speech. Parts of speech are assigned short labels, or tags, such as NN, VB,
The process of automatically assigning parts of speech to words in text is called part-of-speech tagging, POS tagging, or just tagging.
Automatic tagging is an important step in the NLP pipeline, and is useful in a variety of situations including: predicting the behavior of previously unseen words, analyzing word usage in corpora, and text-to-speech systems.
Some linguistic corpora, such as the Brown Corpus, have been POS tagged.
A variety of tagging methods are possible, e.g. default tagger, regular expression tagger, unigram tagger and n-gram taggers. These can be combined using a technique known as backoff.
Taggers can be trained and evaluated using tagged corpora. Useful features include:
- Tag Frequency
- Context: previous word or tag
- Word Shape: prefix, suffix, capitalization, character type
Backoff is a method for combining models: when a more specialized model (such as a bigram tagger) cannot assign a tag in a given context, we backoff to a more general model (such as a unigram tagger).
Part-of-speech tagging is an important, early example of a sequence classification task in NLP: a classification decision at any one point in the sequence makes use of words and tags in the local context.
N-gram taggers can be defined for large values of n, but once n is larger than 3 we usually encounter the sparse data problem; even with a large quantity of training data we only see a tiny fraction of possible contexts.
Modeling the linguistic data found in corpora can help us to understand linguistic patterns, and can be used to make predictions about new language data.
Supervised classifiers use labeled training corpora to build models that predict the label of an input based on specific features of that input.
Supervised classifiers can perform a wide variety of NLP tasks, including document classification, part-of-speech tagging, sentence segmentation, dialogue act type identification, and determining entailment relations, and many other tasks.
Decision trees are automatically constructed tree-structured flowcharts that are used to assign labels to input values based on their features. Although they're easy to interpret, they are not very good at handling cases where feature values interact in determining the proper label.
Most of the models that are automatically constructed from a corpus are descriptive they let us know which features are relevant to a given patterns or construction, but they don't give any information about causal relationships between those features and patterns.

Empirical Linguistics and Language Engineering

It is important to evaluate your results against (i) a baseline and if possible (ii) a gold standard.
- The standard baseline for most tagging tasks is MFT (most frequent tag), equivalent to the unigram tagger.
- The equivalent baseline for Word Sense Disambiguation is normally called first sense, as senses are normally ordered with the most salient (≈ most frequent) first.
We normally have to trade off recall (coverage) vs precision (accuracy).
We can get more results OR better ones.
When training a supervised classifier, you should split your corpus into three datasets: a training set for building the classifier model; a dev-test set for helping select and tune the model's features; and a test set for evaluating the final model's performance.
When evaluating a supervised classifier, it is important that you use fresh (unseen) data, that was not included in the training or dev-test set. Otherwise, your evaluation results may be unrealistically optimistic.

That’s all folks

HG251: Language and the Computer Francis Bond, 2011.