For revision: reread chapters 1-6, redo the tutorial problems.
in.
list = ['Monty', 'Python']
list[0] = 'Monty'
list[0:1] = ['Monty']
len(list) = 2
string = "string"
string[0] = "s"
string[1:3] = "rin"
len(string) = 5
string.startswith('substr')
string.endswith('substr')
string.isalpha()
string.islower()
string.istitle()
string.isupper()
'Monty Python'.split() gives
['Monty', 'Python'].
'/'.join(['Monty', 'Python']) gives
'Monty/Python'.
freq['cat'] = 12.
pos = {}
pos = {'furiously': 'adv', 'ideas': 'n', 'colorless': 'adj'}
t = 12345, 54321, 'hello!'
()
nouns = [ w for (w, t) in tagged_text if t.startswith('N')]
print "Tokens = %d; Types = %d; Tokens/Type = %0.2d" %
(len(words), len(set(words), 1.0*len(words)/len(set(words)))
| Python Expression | Comment |
|---|---|
| for item in s | iterate over the items of s |
| for item in sorted(s) | iterate over the items of s in order |
| for item in set(s) | iterate over unique elements of s |
| for item in reversed(s) | iterate over elements of s in reverse |
| for item in set(s).difference(t) | iterate over elements of s not in t |
| for item in random.shuffle(s) | iterate over elements of s in random order |
| Operation | Result |
|---|---|
s[i] = x |
item i of s is replaced by x |
| s[i:j] = t | slice of s from i to j is replaced by the contents of the iterable t |
| del s[i:j] | same as s[i:j] = [] |
| s[i:j:k] = t | the elements of s[i:j:k] are replaced by those of t |
| del s[i:j:k] | removes the elements of s[i:j:k] from the list |
| s.append(x) | same as s[len(s):len(s)] = [x] |
| s.extend(x) | same as s[len(s):len(s)] = x |
| s.count(x) | return number of i‘s for which s[i] == x |
| s.index(x[, i[, j]]) | return smallest k such that s[k] == x and i <= k < j |
| s.insert(i, x) | same as s[i:i] = [x] |
| s.pop([i]) | same as x = s[i]; del s[i]; return x |
| s.remove(x) | same as del s[s.index(x)] |
| s.reverse() | reverses the items of s in place |
| s.sort([cmp[, key[, reverse]]]) | sort the items of s in place |
| Method | Functionality |
|---|---|
| s.find(t) | index of first instance of string t inside s (-1 if not found) |
| s.rfind(t) | index of last instance of string t inside s (-1 if not found) |
| s.index(t) | like s.find(t) except it raises ValueError if not found |
| s.rindex(t) | like s.rfind(t) except it raises ValueError if not found |
| s.join(text) | combine the words of the text into a string using s as the glue |
| s.split(t) | split s into a list wherever a t is found (whitespace by default) |
| s.splitlines() | split s into a list of strings, one per line |
| s.lower() | a lowercased version of the string s |
| s.upper() | an uppercased version of the string s |
| s.title() | a titlecased version of the string s |
| s.strip() | a copy of s without leading or trailing whitespace |
| s.replace(t, u) | replace instances of t with u inside s |
| Function | Meaning |
|---|---|
| s.startswith(t) | test if s starts with t |
| s.endswith(t) | test if s ends with t |
| t in s | test if t is contained inside s |
| s.islower() | test if all cased characters in s are lowercase |
| s.isupper() | test if all cased characters in s are uppercase |
| s.isalpha() | test if all characters in s are alphabetic |
| s.isalnum() | test if all characters in s are alphanumeric |
| s.isdigit() | test if all characters in s are digits |
| s.istitle() | test if s is titlecased (all words in s have have initial capitals) |
| Example | Description |
|---|---|
| d = {} | create an empty dictionary and assign it to d |
| d[key] = value | assign a value to a given dictionary key |
| d.keys() | the list of keys of the dictionary |
| list(d) | the list of keys of the dictionary |
| sorted(d) | the keys of the dictionary, sorted |
| key in d | test whether a particular key is in the dictionary |
| for key in d | iterate over the keys of the dictionary |
| d.values() | the list of values in the dictionary |
| dict([(k1,v1), (k2,v2), ...]) | create a dictionary from a list of key-value pairs |
| d1.update(d2) | add all items from d2 to d1 |
| defaultdict(int) | a dictionary whose default value is zero |
| Example | Description |
|---|---|
fdist = FreqDist(samples) |
create a frequency distribution containing the given samples |
fdist[sample] += 1 |
increment the count for this sample |
fdist['monstrous'] |
count of the number of times a given sample occurred |
fdist.freq('monstrous') |
frequency of a given sample |
fdist.N() |
total number of samples |
fdist.most_common(n) |
the n most common samples and their frequencies |
for sample in fdist: |
iterate over the samples |
fdist.max() |
sample with the greatest count |
fdist.tabulate() |
tabulate the frequency distribution |
fdist.plot() |
graphical plot of the frequency distribution |
fdist.plot(cumulative=True) |
cumulative plot of the frequency distribution |
fdist1 |= fdist2 |
update fdist1 with counts from fdist2 |
fdist1 < fdist2 |
test if samples in fdist1 occur less frequently than in fdist2 |
| Example | Description |
|---|---|
cfdist = ConditionalFreqDist(pairs) |
create a conditional frequency distribution from a list of pairs |
cfdist.conditions() |
the conditions |
cfdist[condition] |
the frequency distribution for this condition |
cfdist[condition][sample] |
frequency for the given sample for this condition |
cfdist.tabulate() |
tabulate the conditional frequency distribution |
cfdist.tabulate(samples, conditions) |
tabulation limited to the specified samples and conditions |
cfdist.plot() |
graphical plot of the conditional frequency distribution |
cfdist.plot(samples, conditions) |
graphical plot limited to the specified samples and conditions |
cfdist1 < cfdist2 |
test if samples in cfdist1 occur less frequently than in cfdist2 |
for w in t: or for word in text:. This must be
followed by the colon character and an indented block of code, to be
executed each time through the loop.
while (n != 20):
set([w.lower() for w in text if w.isalpha()]).
if len(word) <
5:. This must be followed by the colon character and an indented
block of code, to be executed only if the condition is true.
def keyword,
as in def mult(x, y); x and y are parameters of the function,
and act as placeholders for actual data values.
mult(3, 4),
e.g., len(text1).
.py extension,
and accessed using an import statement.
statement.
help(v) in the
Python interactive interpreter to read the help entry for this kind
of object.
a is a list and we assign b = a, then any
operation on a will modify b, and vice versa.
is operation tests if two objects are identical internal
objects, while == tests if two objects are equivalent. This
distinction parallels the type-token distinction.
+ we make some data test = [(1, 1,
2), (1, 0, 1), (2, 2, 4)]
for (a, b,c ) in test:
if (a + b != c):
print "test failed: %s + %s not equal to %s" % (a, b, c)
len(text) and word types using len(set(text)).
sorted(set(t)).
[f(x) for x in text].
nltk.corpus.brown.
nltk.corpus.brown.raw the whole corpus as one string
nltk.corpus.brown.words the whole corpus tokenized into words
nltk.corpus.brown.sents the whole corpus tokenized into words and split into sentences
nltk.corpus.brown.tagged_words the whole corpus tokenized and tagged
nltk.corpus.brown.tagged_sents the whole corpus tokenized and tagged and split into sents
f using text = open(f).read().
u using text = urlopen(u).read().
file = open('output.txt', 'w'), then adding content to the file:
file.write("Monty Python"), and finally closing the file:
file.close().
for line in open(f):.
nltk.word_tokenize().
appear).
re.findall() to find all substrings in a string that match a
pattern.
r prefix: r'regexp'.
. ^ $ * + ? { [ ] \ | ( )
\n, this takes on a special meaning (newline character);
however, when backslash is used before regular expression wildcards
and operators, e.g. \., \|, \$, these
characters lose their special meaning and are matched literally.
「([non-ascii]+)」\(([ \w]+)\) finds translation pairs
\b(\w+) such as (w+) finds hypernym-hyponyms
m=re.match(pattern, string)
if m:
print ('Match found: ', m.group())
| Operator | Behavior |
|---|---|
| . | Wildcard, matches any character |
| ^abc | Matches some pattern abc at the start of a string |
| abc$ | Matches some pattern abc at the end of a string |
| [abc] | Matches one of a set of characters |
| [A-Z0-9] | Matches one of a range of characters |
| |ing|s | Matches one of the specified strings (disjunction) |
| * | Zero or more of previous item, e.g. a*, [a-z]* (also known as Kleene Closure) |
| + | One or more of previous item, e.g. a+, [a-z]+ |
| ? | Zero or one of the previous item (i.e. optional), e.g. a?, [a-z]? |
| {n} | Exactly n repeats where n is a non-negative integer |
| {n,} | At least n repeats |
| {,n} | No more than n repeats |
| {m,n} | At least m and no more than n repeats |
| a(b|c)+ | Parentheses that indicate the scope of the operators |
| ([A-Z][a-z]+)\1 | Parentheses also mark a match group |
| Symbol | Function |
|---|---|
| \b | Word boundary (zero width) |
| \d | Any decimal digit (equivalent to [0-9]) |
| \D | Any non-digit character (equivalent to [^0-9]) |
| \s | Any whitespace character (equivalent to [ \t\n\r\f\v] |
| \S | Any non-whitespace character (equivalent to [^ \t\n\r\f\v]) |
| \w | Any alphanumeric character (equivalent to [a-zA-Z0-9_]) |
| \W | Any non-alphanumeric character (equivalent to [^a-zA-Z0-9_]) |
| \t | The tab character |
| \n | The newline character |
| \1 | The first match group (\2 is the second, ...) |
HG251: Language and the Computer Francis Bond, 2011.