This is a short focused workshop with two themes: improving the
Wordnet Bahasa and building wordnets for other South East Asian languages.
The first workshop/hackhathon was held in 2014.
Date | Time | Session |
Fri (15) |
10:00-10:15 | Francis Bond (NTU)
Introduction to the workshop
|
10:15-11:00 | Christiane Fellbaum (Princeton)
When and how to add new synsets to WordNet?
|
11:00-11:30 | Francis Bond (NTU)
Linking across languages, linking across cultures
|
11:30-12:30 | Lunch |
12:30-13:00 | David Moeljadi (NTU)
How to deal with affixes in Wordnet Bahasa?
Abstract (click to toggle)
The Indonesian language has a rich affixation system, including a variety of prefixes, suffixes, circumfixes, clitics, and reduplication. Shall we include all these bound morphemes in Wordnet Bahasa? Which one should be and which one should not be added and why?
|
13:00-13:30 | Dora Amalia (Badan Bahasa)
Lexicography challenges in minority languages in Indonesia
Abstract (click to toggle)
The most prominent feature of minority languages is that there is no written text. Indonesia has hundreds of minority languages and it is a very huge lexicographic challenge to document the languages. Strategies for dictionaries making for major languages and minority languages are very different. Dictionaries compiling process of minority languages usually begins with field research, while of major languages can relies on abundant corpus. Besides, the purpose of the dictionary compiling is different, as well. Commercial elements are very attached to the dictionary compiling process of major languages, whereas of minority languages is more idealistic and scientific. This paper discussed and compared various widely used strategies in dictionary making in field lexicography. It is expected that the comparison would yield an alternative strategy which is suitable for language conditions in Indonesian.
|
13:30-14:00 | Totok Suhardijanto (UI)
Creating Javanese WordNet: between challenge and opportunity
Abstract (click to toggle)
This paper outlines challenges and opportunities in creating Javanese WordNet as a part of WordNet Bahasa project to expand its language inventories. Javanese is one of the major languages in Southeast Asian region with total speakers of 80 million people. Despite the difference among its various sub-dialects, Javanese is basically divided into three main groups: Western Javanese, Central Javanese, and Eastern Javanese. In common with other Austronesian languages, there are three styles or registers in Javanese: ngoko, madya, and krama. Based on the latest version of Bausastra Jawa, Javanese has more than 36,000 head words. When creating a WordNet for a new language, there are three approaches we can choose. First, create a WordNet from scratch. Second, translate an existing WordNet. Three, use a top ontology and extend it with a local synonym dictionary. This paper reviews these three options to choose one approach that suits the best for Javanese WordNet.
|
14:00-14:30 | Tea break |
| ↓ Move to HSS Seminar Room 9 (HSS-B1-11) ↓ |
14:30-15:00 | Lim Lian Tze (Malaysia)
Compiling a multilingual biodiversity name register from monolingual dictionaries, Wordnets and Wikidata
Abstract (click to toggle)
The Malay language has an abundance of names for flora and fauna, as befitting the rich biodiversity of the region where it is spoken. We take a quick look at how Wordnet Bahasa can benefit from this nomenclature, by aligning data from monolingual dictionaries, existing wordnets, and Wikidata.
|
15:00-15:30 | Virach Sorntlertlanvanich (Sirindhorn)
Asian WordNet: web service and the collaborative platform
Abstract (click to toggle)
It is costly and time-consuming in developing and maintaining an ontology database. Once the ontology is established we also expect that it can serve a task that requires a cross language concept mapping. WordNet with its expression power of senses in terms of synset (a set of synonyms) can facilitate the computational expression very well. We adopted the advantages of sense expression by a list of words, the so call synset, to provide a common platform for collaborative WordNet construction for a language. To prepare an initial WordNet for a certain language, we align the synset to a list of words from the existing bi-lingual dictionaries. The degree of confidence in the synset assignment has been proposed by computing the distance between a word to a member of a synset. Word synonyms are also used to serve in finding a candidate of synset. As a result, the list of candidate synsets is proposed to a word entry together with a degree of confidence score. In our approach, we show the efficiency in nominating the synset candidate by using the most common lexical information. The algorithm is evaluated against the implementation of Thai-English, Indonesian-English, and Mongolian-English bi-lingual dictionaries. The experiment also shows the effectiveness of using the same type of dictionary from different sources. To exhibit a cross language access to the WordNet, we use the synset in the Princeton WordNet (PWN) as a key to retrieve a set of words in the target language. Moreover, the environment for developing the WordNet for Asian languages is designed in a distributed manner. Each language may take care of the environment and share their own resulted WordNet through a web service protocol (SOAP). Currently, Asian WordNet (AWN) can serve some languages depending on the progress of the contribution namely, Bengali (0.90%) Indonesian (8.17%), Japanese (30.35%), Korean (35.93%), Lao (33.05%), Mongolian (1.38%), Burmese (16.95%), Napali (0.03%), Sinhala (0.23%), Sundanese (0.06%), Thai (40.27%), and Vietnamese (10.40%).
|
15:30-16:15 | Break |
16:15-16:45 | Wang Wenjie (NTU)
Sentiment tagging of NTU-MC
Abstract (click to toggle)
In this session, we will introduce the current tools used to annotate sentiment in the NTUMC, for the English, Chinese and Japanese texts of The Speckled Band. We will look at both concept-level and “chunk”-/phrase-level annotation, the processes behind the annotation, as well as certain phenomena of interest we came across, particularly on the inter-lingual differences.
|
16:45-17:15 | Kuribayashi Takayuki (NTU)
Orthographic variations in the Japanese Wordnet
Abstract (click to toggle)
A word in Japanese usually has more than one way to be written. However, the Japanese Wordnet 1.1 does not cover the abundance of orthographic variation. We therefore have been adding the orthographic variants which were obtained from other open-licensed dictionaries. At the same time, we are grouping both the existing and new variants. Through these attempts, we can increase the coverage of the Japanese Wordnet and organize synonyms.
|
17:15-18:00 | Discussion: orthographic issue in Wordnet |
18:00-21:00 | Dinner + Excursion: Lau Pa Sat → Marina Bay → Gardens By The Bay |
Sat (16) |
10:00-10:30 | Luís Morgado da Costa (NTU)
Extending Wordnet: The Never-ending Story...
Abstract (click to toggle)
Motivated by the ongoing semantic annotation of the NTU-Multilingual Corpus, we will present some ongoing efforts to expand the depth and breath of the Open Multilingual Wordnet’s coverage by introducing interjections and classifiers as new classes of concepts in Wordnet.
|
10:30-11:00 | František Kratochvíl and Luís Morgado da Costa (NTU)
Extending lexical resources for Abui, a Papuan language
Abstract (click to toggle)
We have conducted two lexicographic sessions in the Abui community in 2013 and 2014 using the Rapid Words methodology, which utilizes the community to participate in a crowd-sourced rapid documentation of the lexicon. We have collected about 16 thousand words over about 8 days of work, with 25 participants. The data are hand-written wordlist, linked to the Semantic Domain ontology. About 7 thousand words are currently digitized and provided with an Indonesian translation. In our paper we discuss the linking process utilizing a linking between WordNet and Semantic Domains (Muhammad Zulhelmy et al. 2014) to add Indonesian senses, English gloss and definition. In the next stage of the project, these can be checked in the next lexicographic workshop.
|
11:00-11:30 | Jack Halpern (CJK Dictionary Institute)
Major issues in compiling multilingual Kanji dictionaries
Abstract (click to toggle)
To satisfy the urgent need for high-quality Japanese study materials in non-English languages, The CJK Dictionary Institute has embarked on a project to create a series of foreign language editions of The Kodansha Kanji Learner's Dictionary (KKLD), a popular kanji learner’s dictionary that has become a standard reference work in Japanese language education. The Kanji Learner's Dictionary Series (KLD) is an ongoing project to translate a subset of KKLD into some of the major languages of the world. The aim of this presentation is to examine several key issues in pedagogical lexicography both from the lexicographer's and from the kanji learner's points of view, focusing on compilation and design innovations that increase learner usability. This presentation focuses on several key issues that directly impact the pedagogical efficacy of kanji dictionaries. Firstly, sense division and sense ordering must take into account the complex interlingual equivalences between the source language and target language. For logographic systems like Chinese and Japanese, logico-semantic ordering achieves the highest pedagogical benefit. Secondly, Chinese characters form a network of interrelated parts that function as an integrated system. This presentation describes compilation techniques for presenting senses in a manner that makes the semantic transparency and morphological productivity of each character clear. Thirdly, a valuable feature is the core meaning, a concise keyword that defines the most dominant meaning of each character. Fourthly, the System of Kanji Indexing by Patterns (SKIP) enables users to locate characters as quickly and as accurately as in alphabetical dictionaries. Thanks to these features, Japanese learners around the globe have quick access, in their own language, to a wealth of information on kanji that is linguistically accurate, easy to use, and carefully adapted to their practical needs.
|
11:30-12:30 | Lunch |
12:30-13:00 | Lim Lian Tze (Malaysia)
A day (or two) with lexicographers
Abstract (click to toggle)
We've had the chance to work with lexicographers from the Division of Dictionary, Terminology and Lexicography, Dewan Bahasa & Pustaka (DBP), Malaysia. In this talk, we share our experiences from a one-day hands-on workshop on editing Wordnet Bahasa entries, as well as other discussions.
|
13:00-13:30 | Rusli bin Abd Ghani (DBP Malaysia)
Malay lexicography & terminology: a treasure trove of lexical & cultural information
Abstract (click to toggle)
This paper will consist of a brief description of Malay lexicography and terminology from a historical perspective and a detail discussion of a Malay monolingual dictionary, Kamus Dewan, first published in 1970, with particular attention to the logical structures of the dictionary and definition styles used in describing meanings. Other lexical resources such as Malay technical terms dictionaries and bilingual dictionaries (especially Malay-English and English-Malay dictionaries) will also be discussed.
|
13:30-14:00 | Ruli Manurung (UI)
Developing an Indonesian Treebank
Abstract (click to toggle)
In this talk we will present our efforts in developing an Indonesian Treebank. Building on our previous work concerning the development of an Indonesian part-of-speech tagged corpus consisting of 10.000 sentences, we set about trying to develop a treebank based on this corpus. Using the existing POS tags and the Penn Treebank bracketing guidelines as a starting point, an incremental process was adopted whereby 3 annotators bracketed the first 100 sentences. During this initial process, the bracketing guidelines and web-based annotation tool were developed. After several iterations, the process was refined and eventually applied to the first 1000 sentences, which have now been released under a Creative Commons license at http://bahasa.cs.ui.ac.id/treebank/corpus. We will also present initial work on training a statistical parser using this Treebank, and how this parser is being used in an ongoing text mining project.
|
14:00-15:00 | Tea break |
15:00-15:30 | Elvira Nurfadhilah (BPPT)
Solving ambiguities in Indonesian words by morphological analysis using Minimum Connectivity Cost
Abstract (click to toggle)
The Indonesian language (Bahasa Indonesia) has a number of uncommon characteristics, such as a great amount of derivational affixes. There are so many combinations of affixes and stems in Bahasa Indonesia that ambiguities often arise. To record all words into a word dictionary is almost impossible because it will make the size of the word dictionary huge and processing time very long. We propose a method to analyze the morphology of Indonesian words by using part-of-speech (POS) tagged data, an affix rule table and minimum connectivity costs to solve the problems mentioned above. Experiments showed that our system achieved a good analysis result (more than 97% accuracy).
|
15:30-16:00 | Sudha Bhingardive (IIT Bombay)
Most frequent sense determination using deep learning
Abstract (click to toggle)
An acid test for any new Word Sense Disambiguation (WSD) algorithm is its performance against the Most Frequent Sense (MFS). The field of WSD has found the MFS baseline very hard to beat. Clearly, if WSD researchers had access to MFS values, their striving to better this heuristic will push the WSD frontier. However, getting MFS values requires sense annotated corpus in enormous amounts, which is out of bounds for most languages, even if their WordNets are available. In this paper, we propose an unsupervised method for MFS detection from the untagged corpora, which exploits word embeddings. We compare the word embedding of a word with all its sense embeddings and obtain the predominant sense with the highest similarity. We observe significant performance gain for Hindi WSD over the WordNet First Sense (WFS) baseline. As for English, the SemCor baseline is bettered for those words whose frequency is greater than 2. Our approach is language and domain independent. (work done with PhD students Sudha, Dhiren and Rudra)
|
16:00-17:00 | Discussion: improving the Wordnet Bahasa and building wordnets for other South East Asian languages |
20:00- | Stand-up comedy show "Kilo Laughs with Joanna Sio" at Kilo Lounge |