NTU Multilingual Corpus (NTU-MC)
Documentation (draft)

The NTU Multilingual Corpus is a collection of parallel texts, with some sense tagged, some treebanked and some marked for sentiment.

The Structure of the Corpora

Monolingual Database Schema

What was annotated when

An incomplete list
Year Class from-to Corpus Comments
2011 hg2002 Singapore Tourist Data website
2012 hg2002 The Cathedral and the Bazaar essay
2013 hg2002 10000, 10598 SPEC (retag)
2014 hg2002 11000, 11607 DANC (retag)
2015 hg2002 蜘蛛の糸 [The Spider's Thread] multilingual
2016 hg8011 55657, 56209 REDH A-E
2018 hg8011 50804, 51464 SCAN
2018 hg2002 45681, 46691 HOUND A-B
2019 hg8011 46692, 47487 HOUND A-D (E two only)
2019 hg2002 47488, 48504 HOUND A-B (with sentiment)
2020 hg2002 48505, 49505 HOUND A-B (C one only) (with sentiment)
2021 hg8011 18525, 18935 FINA A-C (one D) (with sentiment)
2021 hg2002 13147, 13968 NAVA A-C
to 13973 done by RA (with sentiment)

References:

Canonical Citation:

Liling Tan and Francis Bond. 2012. Building and annotating the linguistically diverse NTU-MC (NTU-multilingual corpus). In International Journal of Asian Language Processing 22(4) pp 161–174.

Other References:

Francis Bond, Andrew Devadason, Melissa Rui Lin Teo and Luís Morgado da Costa (2021) Teaching Through Tagging — Interactive Lexical Semantics In Proceedings of the 11th Global Wordnet Conference (GWC 2021)

Francis Bond, Shan Wang, Eshley Huini Gao, Hazel Shuwen Mok, and Jeanette Yiwen Tan. 2013. Developing parallel sense-tagged corpora with wordnets. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse (LAW 2013). Sofia. pp 149–158.

Yu Jie Seah and Francis Bond. 2014. Annotation of Pronouns in a Multilingual Corpus of Mandarin Chinese, English and Japanese. In 10th Joint ACL - ISO Workshop on Interoperable Semantic Annotation Reykjavik.

Slav Petrov, Dipanjan Das, and Ryan McDonald. 2011. A universal part-of-speech tagset. arXiv preprint arXiv:1104.2086.

Shan Wang and Francis Bond. 2014. Building The Sense-Tagged Multilingual Parallel Corpus. In 9th Edition of the Language Resources and Evaluation Conference (LREC 2014), Reykjavik.

Francis Bond, Tomoko Ohkuma, Luis Morgado da Costa, Yasuhide Miura, Rachel Chen, Takayuki Kuribayashi, and Wenjie Wang (2016) A multilingual sentiment corpus for Chinese, English and Japanese. In Proceedings of the LREC 2016 Workshop “Emotion and Sentiment Analysis”, Portorož. pp 59–62



Contributors: Francis Bond, Luís Morgado da Costa, Tuan Anh Le, Michael Wayne Goodman and many more.


Francis Bond <bond@ieee.org>
Division of Linguistics and Multilingual Studies
Nanyang Technological University
Level 3, Room 55, 14 Nanyang Drive, Singapore 637332
Tel: (+65) 6592 1568; Fax: (+65) 6794 6303
This is hosted at github: https://github.com/bond-lab/NTUMC