For revision: reread chapters 1-6, redo the tutorial problems.
in
.
list = ['Monty', 'Python']
list[0] = 'Monty'
list[0:1] = ['Monty']
len(list) = 2
string = "string"
string[0] = "s"
string[1:3] = "rin"
len(string) = 5
string.startswith('substr')
string.endswith('substr')
string.isalpha()
string.islower()
string.istitle()
string.isupper()
'Monty Python'.split()
gives
['Monty', 'Python']
.
'/'.join(['Monty', 'Python'])
gives
'Monty/Python'
.
freq['cat'] = 12
.
pos = {}
pos = {'furiously': 'adv', 'ideas': 'n', 'colorless': 'adj'}
t = 12345, 54321, 'hello!'
()
nouns = [ w for (w, t) in tagged_text if t.startswith('N')]
print "Tokens = %d; Types = %d; Tokens/Type = %0.2d" %
(len(words), len(set(words), 1.0*len(words)/len(set(words)))
Python Expression | Comment |
---|---|
for item in s | iterate over the items of s |
for item in sorted(s) | iterate over the items of s in order |
for item in set(s) | iterate over unique elements of s |
for item in reversed(s) | iterate over elements of s in reverse |
for item in set(s).difference(t) | iterate over elements of s not in t |
for item in random.shuffle(s) | iterate over elements of s in random order |
Operation | Result |
---|---|
s[i] = x |
item i of s is replaced by x |
s[i:j] = t | slice of s from i to j is replaced by the contents of the iterable t |
del s[i:j] | same as s[i:j] = [] |
s[i:j:k] = t | the elements of s[i:j:k] are replaced by those of t |
del s[i:j:k] | removes the elements of s[i:j:k] from the list |
s.append(x) | same as s[len(s):len(s)] = [x] |
s.extend(x) | same as s[len(s):len(s)] = x |
s.count(x) | return number of i‘s for which s[i] == x |
s.index(x[, i[, j]]) | return smallest k such that s[k] == x and i <= k < j |
s.insert(i, x) | same as s[i:i] = [x] |
s.pop([i]) | same as x = s[i]; del s[i]; return x |
s.remove(x) | same as del s[s.index(x)] |
s.reverse() | reverses the items of s in place |
s.sort([cmp[, key[, reverse]]]) | sort the items of s in place |
Method | Functionality |
---|---|
s.find(t) | index of first instance of string t inside s (-1 if not found) |
s.rfind(t) | index of last instance of string t inside s (-1 if not found) |
s.index(t) | like s.find(t) except it raises ValueError if not found |
s.rindex(t) | like s.rfind(t) except it raises ValueError if not found |
s.join(text) | combine the words of the text into a string using s as the glue |
s.split(t) | split s into a list wherever a t is found (whitespace by default) |
s.splitlines() | split s into a list of strings, one per line |
s.lower() | a lowercased version of the string s |
s.upper() | an uppercased version of the string s |
s.title() | a titlecased version of the string s |
s.strip() | a copy of s without leading or trailing whitespace |
s.replace(t, u) | replace instances of t with u inside s |
Function | Meaning |
---|---|
s.startswith(t) | test if s starts with t |
s.endswith(t) | test if s ends with t |
t in s | test if t is contained inside s |
s.islower() | test if all cased characters in s are lowercase |
s.isupper() | test if all cased characters in s are uppercase |
s.isalpha() | test if all characters in s are alphabetic |
s.isalnum() | test if all characters in s are alphanumeric |
s.isdigit() | test if all characters in s are digits |
s.istitle() | test if s is titlecased (all words in s have have initial capitals) |
Example | Description |
---|---|
d = {} | create an empty dictionary and assign it to d |
d[key] = value | assign a value to a given dictionary key |
d.keys() | the list of keys of the dictionary |
list(d) | the list of keys of the dictionary |
sorted(d) | the keys of the dictionary, sorted |
key in d | test whether a particular key is in the dictionary |
for key in d | iterate over the keys of the dictionary |
d.values() | the list of values in the dictionary |
dict([(k1,v1), (k2,v2), ...]) | create a dictionary from a list of key-value pairs |
d1.update(d2) | add all items from d2 to d1 |
defaultdict(int) | a dictionary whose default value is zero |
Example | Description |
---|---|
fdist = FreqDist(samples) |
create a frequency distribution containing the given samples |
fdist[sample] += 1 |
increment the count for this sample |
fdist['monstrous'] |
count of the number of times a given sample occurred |
fdist.freq('monstrous') |
frequency of a given sample |
fdist.N() |
total number of samples |
fdist.most_common(n) |
the n most common samples and their frequencies |
for sample in fdist: |
iterate over the samples |
fdist.max() |
sample with the greatest count |
fdist.tabulate() |
tabulate the frequency distribution |
fdist.plot() |
graphical plot of the frequency distribution |
fdist.plot(cumulative=True) |
cumulative plot of the frequency distribution |
fdist1 |= fdist2 |
update fdist1 with counts from fdist2 |
fdist1 < fdist2 |
test if samples in fdist1 occur less frequently than in fdist2 |
Example | Description |
---|---|
cfdist = ConditionalFreqDist(pairs) |
create a conditional frequency distribution from a list of pairs |
cfdist.conditions() |
the conditions |
cfdist[condition] |
the frequency distribution for this condition |
cfdist[condition][sample] |
frequency for the given sample for this condition |
cfdist.tabulate() |
tabulate the conditional frequency distribution |
cfdist.tabulate(samples, conditions) |
tabulation limited to the specified samples and conditions |
cfdist.plot() |
graphical plot of the conditional frequency distribution |
cfdist.plot(samples, conditions) |
graphical plot limited to the specified samples and conditions |
cfdist1 < cfdist2 |
test if samples in cfdist1 occur less frequently than in cfdist2 |
for w in t:
or for word in text:
. This must be
followed by the colon character and an indented block of code, to be
executed each time through the loop.
while (n != 20):
set([w.lower() for w in text if w.isalpha()])
.
if len(word) <
5:
. This must be followed by the colon character and an indented
block of code, to be executed only if the condition is true.
def
keyword,
as in def mult(x, y)
; x and y are parameters of the function,
and act as placeholders for actual data values.
mult(3, 4)
,
e.g., len(text1)
.
.py
extension,
and accessed using an import statement.
statement.
help(v)
in the
Python interactive interpreter to read the help entry for this kind
of object.
a
is a list and we assign b = a
, then any
operation on a
will modify b
, and vice versa.
is
operation tests if two objects are identical internal
objects, while ==
tests if two objects are equivalent. This
distinction parallels the type-token distinction.
+
we make some data test = [(1, 1,
2), (1, 0, 1), (2, 2, 4)]
for (a, b,c ) in test:
if (a + b != c):
print "test failed: %s + %s not equal to %s" % (a, b, c)
len(text)
and word types using len(set(text))
.
sorted(set(t))
.
[f(x) for x in text]
.
nltk.corpus.brown
.
nltk.corpus.brown.raw
the whole corpus as one string
nltk.corpus.brown.words
the whole corpus tokenized into words
nltk.corpus.brown.sents
the whole corpus tokenized into words and split into sentences
nltk.corpus.brown.tagged_words
the whole corpus tokenized and tagged
nltk.corpus.brown.tagged_sents
the whole corpus tokenized and tagged and split into sents
f
using text = open(f).read()
.
u
using text = urlopen(u).read()
.
file = open('output.txt', 'w')
, then adding content to the file:
file.write("Monty Python")
, and finally closing the file:
file.close()
.
for line in open(f):
.
nltk.word_tokenize()
.
appear
).
re.findall()
to find all substrings in a string that match a
pattern.
r
prefix: r'regexp'
.
. ^ $ * + ? { [ ] \ | ( )
\n
, this takes on a special meaning (newline character);
however, when backslash is used before regular expression wildcards
and operators, e.g. \.
, \|
, \$
, these
characters lose their special meaning and are matched literally.
「([non-ascii]+)」\(([ \w]+)\)
finds translation pairs
\b(\w+) such as (w+)
finds hypernym-hyponyms
m=re.match(pattern, string) if m: print ('Match found: ', m.group())
Operator | Behavior |
---|---|
. | Wildcard, matches any character |
^abc | Matches some pattern abc at the start of a string |
abc$ | Matches some pattern abc at the end of a string |
[abc] | Matches one of a set of characters |
[A-Z0-9] | Matches one of a range of characters |
|ing|s | Matches one of the specified strings (disjunction) |
* | Zero or more of previous item, e.g. a*, [a-z]* (also known as Kleene Closure) |
+ | One or more of previous item, e.g. a+, [a-z]+ |
? | Zero or one of the previous item (i.e. optional), e.g. a?, [a-z]? |
{n} | Exactly n repeats where n is a non-negative integer |
{n,} | At least n repeats |
{,n} | No more than n repeats |
{m,n} | At least m and no more than n repeats |
a(b|c)+ | Parentheses that indicate the scope of the operators |
([A-Z][a-z]+)\1 | Parentheses also mark a match group |
Symbol | Function |
---|---|
\b | Word boundary (zero width) |
\d | Any decimal digit (equivalent to [0-9]) |
\D | Any non-digit character (equivalent to [^0-9]) |
\s | Any whitespace character (equivalent to [ \t\n\r\f\v] |
\S | Any non-whitespace character (equivalent to [^ \t\n\r\f\v]) |
\w | Any alphanumeric character (equivalent to [a-zA-Z0-9_]) |
\W | Any non-alphanumeric character (equivalent to [^a-zA-Z0-9_]) |
\t | The tab character |
\n | The newline character |
\1 | The first match group (\2 is the second, ...) |
HG251: Language and the Computer Francis Bond, 2011.