COR Lab 3: SQL

This lab will give you practice using SQL. We will use SQLite (e.g. using the DB Browser for SQLite) to search through the NTU Multilingual Corpus (NTU-MC). The NTU-MC is a multilingual sense-tagged corpus. You'll be provided with working databases for both English and Mandarin Chinese (but you don't need to have any command of Mandarin Chinese to complete this assignment).

Please answer each question providing both the SQL query, the retrieved answer and where necessary some discussion. Please use the eng_new.db database.

Upload the final lab report through as pdf
It should be called lab3-yoursurname.pdf

Replace the ??? with your answers.

This is a simplified explanation of the tables you'll have to use:

table: concept
sid|cid|clemma|tag|tags|comment|ntag|usrname
sid - sentence id (unique)
cid - concept id (unique only in a given sentence)
clemma - concept lemma (may be a multi-word expression)
tag - concept tag (may be a synset, or a number of other predefined tags that you don't need to know about)
tags - IGNORE
comment - IGNORE
ntag - IGNORE
usrname - IGNORE

table: word
sid|wid|word|pos|lemma|cfrom|cto|comment|usrname
sid - sentence id (unique)
wid - word id (unique only in a given sentence)
word - word surface form
pos - part-of-speech (consult http://compling.hss.ntu.edu.sg/ntumc/ for the tag sets)
cfrom - IGNORE
cto - IGNORE
comment - IGNORE
usrname - IGNORE

### not actually used 
table: cwl (concept to word links for each sentence) 
sid|wid|cid|usrname
sid - sentence id (unique)
wid - word id (unique only in a given sentence) 
cid - concept id (unique only in a given sentence)
usrname - IGNORE

Q1: (2)

What is the most frequent word in the English Corpus? And in the (Mandarin) Chinese Corpus?

???

Q2: (2)

What are the total number of words tagged respectively as Adjective, Noun, Verb and Adverb (using the Universal Part-of-Speech Tagset) for English? And for Chinese? (Please use the mappings provided by NTU-MC for English and Chinese. ).

Your answer should have eight numbers (one for each POS+Language combination) and possibly eight queries.

???

Q3: (4)

What are the five concepts with the most variety in lemmas for English and Mandarin (i.e. the synsets with the most different lemmas). Please ignore tags that are not synsets, i.e. the tag must be of the form: xxxxxxxx-p (where x are numbers and p is the pos code).

???

Look the synsets up in the corpus search and the open multilingual wordnet and look at the lemmas (in either English, Mandarin or both): why do you think these concepts have so many lemmas. Note that you can put the concept id (xxxxxxxx-p) in the search box for OMW. Some concepts are new at NTU and may not show up (those starting with 7,8 or 9).

???

Q4: (2)

Compare the frequency of usage of adjectives (using Petrov, Das and McDonald's Universal Part-of-Speech Tagset) between English and Chinese. Use the Frequency Comparison provided at http://sigil.collocations.de/wizard.html. Using the Universal Part-of-Speech Tagset and the NTU-MC as sample, can we say that any of the two languages differ (significantly) in the usage of adjectives? (Please show any new queries used, the input you give to the wizard, and paste the statistical results along with a short answer).

???


COR (Corpus Linguistics) main page.

Francis Bond <bond@ieee.org> <francis.bond@upol.cz>
Home page