NTU Computational Linguistics Lab Funded Projects

This page lists a selection of Funded projects that we are working on in the computation linguistics group at the Division of Linguistics and Multilingual Studies, Nanyang Technological University, Singapore.

Current

Integration of an Online Language Error Detection System into an Engineering Communication Course Curriculum (PI Roger Winder) EdEx

Finished

Digital Mapping the Literary Epigraph: Quantitative analysis of literary influence using network theory and thousands of epigraphs
(PI: Mathews) 2017–2019 (MOE Tier 1)
Abstract …
The investigators propose to use the epigraph (the quotation positioned at the start of many novels) as a clear empirical marker of literary influence between time periods and countries. We aim to build a corpus of approximately 20,000 epigraphs and thoroughly investigate the connections within this big data set using network theory. We will explore the resulting implications by constructing a digital map of the world that demonstrates the evolution of the novel and its influences.
Research questions
- What were the key moral, philosophical, and aesthetic influences on literature through the ages?
- How do different national literatures influence one another and what does the global map of literary "soft power" look like?
- In what ways does visualising literature as a network rather than a set of discreet objects alter our understanding of the history of literature?
The project charts the benefits of openness, collaboration and the circulation of knowledge and expertise that make it highly relevant to Singapore's status as a cosmopolitan hub. This is a pioneering study in the digital humanities with an interdisciplinary blend of literary studies with computational linguistics that will help boost Singapore's transformation into a global media city.
The outcome of this research project will be the very first digital map of literary influence. Therefore this project uses an innovative methodology to reveal connections between novels that are only observable through scale. We are also committed to making our findings accessible to both students and a non-specialist audience.
The academic significance is demonstrated by the outputs that range from academic articles and proposed book project to the online database and open-source digital mapping software. The project is innovative and unique and this is an opportune moment for investigation into literary influence using the latest developments in digital technology.
Syntactic Well-Formedness Diagnosis and Error-Based Coaching in Computer Assisted Language Learning using Machine Translation Technology
(PI: Francis Bond) 2016–2018 (MOE TRF)
Abstract …

We propose a new language tutoring system that provides explicit feedback on users’ errors using natural language processing technology. A student enters a sentence, and the system uses a computational grammar to check if it is well-formed. If there is some error, the system uses mal-rules to identify both the intended meaning and the error. If there are multiple possible intended meanings, then it uses machine translation (MT) technology to ask which was meant, using the student’s first language. After this, it can accurately provide hints about the errors. The use of MT technology in meaning disambiguation for tutoring systems is, to the best of our knowledge, completely unprecedented, and it will enable systems to detect and provide feedback on many classes of grammatical errors with great confidence. Venturing guesses, as is customary in these kinds of systems, can potentially lead students into confusion, especially if the proposed correction had a different meaning than the one initially intended by the student.
We propose to build a Bilingual Online Language Tutor prototype focusing on early Mandarin Chinese Second Language (L2) learners using English as their source language. Our system will also produce, from the users’ interaction, a Learner Corpus, annotated automatically with syntactic and semantic information.
Ultimately, we aim to prove the viability of a new integrated rule-based MT approach to disambiguate students’ intended meaning in computer-assisted language teaching. This is a necessary step to provide accurate coaching on how to correct ungrammatical input, and it will allow us to overcome the current bottleneck in Computer Assisted Language Learning (CALL) – an exponential burst of ambiguity caused by ambiguous lexical items (Flickinger, 2010). Ambiguity makes it hard to coach a student without confirming the intended meaning behind a malformed input, especially when there are many possible corrections. Our system will be the first to use MT technology to confirm with the student what was his/her intended meaning. With this information, it can more precisely pinpoint the error, and offer constructive grammatical coaching.
More technically, our proposal integrates precise syntactic parsers and semantics-based MT to leverage information across languages. We will integrate results from surveying the usage of lexical items and syntactic structures, along with the most common writing mistakes made by Chinese (L2) learners. This will allow us to program a grammar that selectively accepts ungrammatical sentences, marking them as ungrammatical. This special grammar can be used both to identify grammar errors and to reconstruct the semantics of ungrammatical inputs (Bender et al., 2004), which can then be used by the MT component to enable source-language interaction and feedback. Our prototype will be evaluated by NTU undergraduate students of Mandarin Chinese (L2) in the Centre for Modern Languages (HSS). Our goal is to build a system by the end of the project that is usable for early language learners, and furthermore is extensible to higher levels and new languages.
Acquisition of Physical Action Verbs by Bilingual Singaporean Preschoolers
(PI: Helena Gao) MOE Tier 2
Semi-automatic implementation of clinical practice guidelines in Singapore hospitals
(PIs: Jung-Jae Kim, Francis Bond) MOE Tier 1
Abstract …
Electronic medical record (EMR) systems have been shown in many studies to improve healthcare quality. In Singapore, there is steady adoption of EMRs by local healthcare providers, particularly in large institutions like the restructured hospitals. In these hospitals, healthcare professionals create, update and maintain a trail of paper and electronic clinical documentation such as clinical notes, laboratory tests, prescriptions, and discharge summaries for every patient. The availability of patient records in electronic forms and in large amount opens the door of analyzing the content of the records to help doctors in decision-making and also for knowledge discovery.
As such, many local restructured hospitals are now trying to move to a completely paperless documentation system so as to reduce the duplication of paper and electronic data inputs. However, there are several challenges:
- Many EMR vendors are stuck in past paradigms of structured data entry, and merely translate physical paper forms into electronic format without improvement in speed and usability.
- The fixed structured clinical forms do not enable free expressions and often frustrate the clinician due to the limitation in vocabulary unlike on paper.
Numerous studies show that clinician users are extremely concerned about EMR usability and system performance that will impact on their usage satisfaction and adoption. In fact, the old paradigm of form-based data entry has led to widespread user resistance to EMRs and electronic documentation.
In the current paradigm of EMR systems, we find the following problems:
- The co-existence of paper note entry and electronic data entry which are frequently duplicative, thus redundant.
- The electronic user interfaces are not user-friendly, but time-consuming due to the great number of structured forms to be filled in and options to be considered.
- The paper note entry and the freestyle text entry of EMR systems are not aligned to the underlying semantic representation of the structured forms.
These problems lead us to propose the development of a new platform of electronic clinical documentation interface, which assumes that the patient’s medical records are written as electronic documents in free style (i.e. natural language like English) and that, instead of structured data entry, the free text will be parsed and its semantics will be encoded in real time for the structured forms, for example, extracting information of diagnosis and lab orders.
Natural language processing (NLP) techniques like part-of-speech tagging and parsing might be useful to understand the meaning of the free text of patient records. However, available NLP solutions that were trained on such text as newspaper articles and research papers cannot be directly applied to the patient records due to such characteristics that Singaporean EMRs are full of incomplete sentences and non-standard abbreviations, some of which even other doctors may not confidently understand. For similar reasons, available EMR solutions with NLP techniques developed in other English speaking countries like US cannot be directly applied to Singaporean EMRs, where the EMRs targeted by the solutions usually consist of full and grammatical sentences. Thus, there is a need of developing a new NLP solution for Singaporean EMRs.
Multilingual Semantic Analysis
(PI: Francis Bond, Tomoko Ohkuma) 2014–2018 Joint research with Fuji-Xerox
That’s what you meant: A Rich Representation for Manipulating Meaning (PI: Bond)
Abstract …

There are two main approaches to the study of meaning in text. One looks at the relations between words when they are used (structural semantics). For example, in the sentence The head fired the driver, there is an act of firing, where the boss is the one who fires the driver. A typical (simplified) representation of this would be [fire (head, driver)]. Another approach is to consider the meaning of each word in relation to our knowledge of other words (lexical semantics): in this sentence head means something similar to "boss" rather than being a "body part"; fire means to "terminate employment" not to "discharge a projectile"; and driver is a "person who drives a vehicle", not a "golf club" or "piece of software". A typical (simplified) representation of this would be [The head₃ fired₂ the driver₁], where the subscripts refer to a definition of the meaning in some external knowledge source.
Our goal is to unite these two approaches (structural and lexical) in an integrated semantic framework in order to be able to study the interactions between the two kinds of information and to better model language computationally. In our integrated representation, the final representation will be something like: [fire₂ (head₃, driver₁)]. More technically, we are integrating minimal recursion semantic representations with wordnet senses extending both as necessary. We will look at three languages: Chinese, English and Japanese, using and extending existing resources.
Grammar Matrix Reloaded: Syntax and Semantics of Affectedness
(PI: Kratochvíl) MOE Tier 2
A formal syntactic model for incremental parsing with unification (PI: Bond)
Assessing the Effect of License Choice on the Use of Lexical Resources (PI: Bond)
Automatically determining meaning by comparing a text to its translation: Comparing structural and lexical semantics using aligned text (PI: Bond)
Equivalent but Different: how languages represent meaning in different ways (PI: Bond)
Revealing Meaning Using Multiple Languages (PI: Bond)
Shifted in Translation: An Empirical Study of Meaning Change across Languages (PI: Bond)

Listed in reverse chronological order of start of grant (newest first).

Un-Funded Projects

The Joke Generator (PI: Bond)

Last modified: 2020-01-02