NTU Multilingual Corpus (NTU-MC)
Documentation (draft)

The NTU Multilingual Corpus is a collection of parallel texts, with some sense tagged, some treebanked and some marked for sentiment.

The Structure of the Corpora

Sense Tagging Guidelines

The current version of the sense tagging guidelines.

Illustrated guide for students

Guide for tagging sentiment

Parts of Speech

English POS tags

Monolingual Database Schema

SQL Schema for the Indonesian Corpus ind.db (not showing logs)

Training-test splits

Corpora come with standard splits (train/dev/test: 60/20/20) and folds, that start on paragraph boundaries and are balanced by the number of words. More detail here.

References:

Canonical Citation:

Liling Tan and Francis Bond. 2012. Building and annotating the linguistically diverse NTU-MC (NTU-multilingual corpus). In International Journal of Asian Language Processing 22(4) pp 161–174.

Other References:

Francis Bond, Shan Wang, Eshley Huini Gao, Hazel Shuwen Mok, and Jeanette Yiwen Tan. 2013. Developing parallel sense-tagged corpora with wordnets. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse (LAW 2013). Sofia. pp 149–158.

Yu Jie Seah and Francis Bond. 2014. Annotation of Pronouns in a Multilingual Corpus of Mandarin Chinese, English and Japanese. In 10th Joint ACL - ISO Workshop on Interoperable Semantic Annotation Reykjavik.

Slav Petrov, Dipanjan Das, and Ryan McDonald. 2011. A universal part-of-speech tagset. arXiv preprint arXiv:1104.2086.

Shan Wang and Francis Bond. 2014. Building The Sense-Tagged Multilingual Parallel Corpus. In 9th Edition of the Language Resources and Evaluation Conference (LREC 2014), Reykjavik.

Francis Bond, Tomoko Ohkuma, Luis Morgado da Costa, Yasuhide Miura, Rachel Chen, Takayuki Kuribayashi, and Wenjie Wang (2016) A multilingual sentiment corpus for Chinese, English and Japanese. In Proceedings of the LREC 2016 Workshop “Emotion and Sentiment Analysis”, Portorož. pp 59–62

Contributors: Francis Bond, Luís Morgado da Costa, Tuan Anh Le, Michael Wayne Goodman and many more.

Francis Bond <bond@ieee.org>

This is hosted at github: https://github.com/bond-lab/IMI

NTU Multilingual Corpus (NTU-MC) Documentation (draft)