The Development of Indonesian POS Tagging System for Computer-aided Independent Language Learning

— Word processing tool is a basic need in learning a language. One of the word processors needed by a language learner is part of speech (POS) tagging. While many POS Tagging tools for Indonesian language have been developed, no systems have been addressed specifically for language learners. This paper presents a study on an Indonesian part of speech (POS) tagging sys-tem developed as one of word processing tools for language learners. We use resources from previous Indonesian POS tagging research, such as MorphInd for the morphological analysis and IPOSTagger for part of speech tagging. Ob-jective and subjective tests are employed to evaluate this system. In the objec-tive test the part of speech tagging results use a system model developed from IPOSTagger in combination with MorphInd as the morphological analyzer, and compared with the results of part of speech tagging produced from the original IPOSTagger system model. The results show that the part of speech tagging accuracy using this system model is higher than other models. For its subjective evaluation, Mean Opinion Score (MOS) is used to the 24 participating respond-ents. The MOS results obtained reach 3,61 for test-1, 3,87 for test-2, and 3,72 for test-3. From the results, we expect that this POS tagging system could be used to help language learners in their Indonesian language self-learning process.


Introduction
The rapid development of Natural Language Processing (NLP) device has led to an increasingly higher need to develop a word processing application to help language learning processes. Previous studies suggest that learning a language with a computer aid namely word processor could significantly improve learner's proficiency than when using conventional method [1]. A word processing tool could provide them with important information such as language morphology, syntax, phonology, semantic and etymology which could be helpful for them in understanding and using language optimally [2]. http://www.i-jet.org The importance of the word processing tool will be even higher for learners desiring to learn foreign language who will likely find it hard thanks to the interference from their mother tongue. Word processing tools have been a basic need in learning a language. This is because these tools could enrich language learners with important aspects, namely writing, translation, and spelling.
One of word processors which could be helpful for language learners is part of speech (POS) tagging. Part of speech tagging is necessarily needed since manual tagging will take too much time and require language experts. The large number of ambiguous words, i.e. those words which have different POS tag when they are in different contexts and complicated language morphology have also been some obstacles for learners in POS tagging.
In previous studies, several POS tagger models, particularly for Indonesian language, have been developed using various approaches. Some of these approaches are statistic-based [3][4] [5], rule-based [6] and transformation-based learning approaches [7]. From these numerous POS tagger types, we learn that while many POS tagging tools for Indonesian language have been studied and developed, no POS tagging tools for general audience have been widely available. Therefore, the access to these POS Tagging tools becomes the main issue for its language learners. This study focuses on developing an Indonesian part of speech (POS) tagging computer-aided self-learning system addressed for language learners. However, it is not impossible that this tool could be used as one of word processing tools to support language learning in classroom. To make its use easier, the system is made in the form of a desktop application which could be installed and operated in the language learners' personal computer. This POS tagging system for self-learning should have a high accuracy level in part of speech tagging. Hence, the best method from previous research on POS tagging for Indonesian language is selected. A POS Tagging with statistical approach using Hidden Markov Model (HMM) is selected to deal with ambiguity since it gives higher accuracy and fast processing time [8]. In addition, an Indonesian morphological analyzing system is applied to help reduce tagging error in unknown words (Out Of Vocabulary/OOV) during the tagging process [9].
The existing system resources to be used to build this system are MorphInd and IPOSTagger [4] [10]. MorphInd system is an Indonesian morphological analyzing system. And IPOSTagger is an Indonesian POS tagger system which applies statistical approach using HMM. While IPOSTagger have applied such methods as Affix tree, Succeeding POS tag and Lexicon from KBBI-Kateglo, we use MorphInd system as a morphological analyzer in OOV words since it could deal with irregular characters such as affixation, reduplication and cliticization (proclitic and enclitic).
We combine both systems into a sub-system of a self-learning system for Indonesian POS Tagger. Finally, a graphical user interface (GUI) is added for interface in order to facilitate the use of this system. Indonesian Part of Speech Tagging POS tagging is a task of tagging part of speech (POS) in each word in any sentence being inputted. POS, which is also called as part of speech, or syntax category, could provide information on the word and those around it. This part of speech information could be used to determine a word's structure syntax, making this POS tagging an important part of syntactic parsing. POS tagging is also used as one of features in named entity recognition and information extraction [11]. In addition to being an important part in Natural Language Processing (NLP) devices, POS tagging can also be used as an aid to help language learners.
POS tagging system could be used by language learners to discover the part of speech tag of a word in a sentence. Learners only need to input the sentence they want its POS is tagged on and the system will give the result immediately. Furthermore, the part of speech information they obtain from the system could be directly used in their language learning.
In developing the Indonesian POS tagging self-learning system, several resources such as Indonesian POS corpus, tagset and POS tagging sub-system are required.

Indonesian Data Preparation
Some supporting data are used for the Indonesian POS Tagging system such as tagset and POS corpus for Indonesian language.
Tagset. Tagset is a sequence of POS tag which may be assigned to a word. Some versions of tagset for Indonesian POS corpus have previously been developed. For the system we are working on, the tagset is taken from the tagset modification used in previous studies [3] [4]. The tagsets used here are 31 POS tag types. Table 1 shows the tagset used by this Indonesian POS tagging system. Indonesian POS Corpus. For the Indonesian POS corpus, we use Dinakaramani corpus, which manually labels 250,000 tokens using 23 tag types [12]. We take around 10,000 tokens from this corpus and re-tag with the tagsets in Table 1. The resulting corpus is then used for the data training process prior to POS tagging process.

Building Indonesian POS Tagging Sub-system
After the supporting data are completed, we make an Indonesian POS tagging subsystem. This sub-system is used for backend process. The main process in this subsystem is morphological analysis and tagging. We add preprocessing in the form of sentence cutting and tokenization before the sub-system main processes are executed. Figure 1 explains the processes in this sub-system. There are two types of input in this sub-system, i.e. training corpus and the sentence to be POS tagged. In the Tagging process, the training corpus input will be used for modeling the word with its POS tag using HMM algorithm. Meanwhile, the sentence to be tagged will be processed in the next processes.
Preprocessing. In this preprocessing, the system will perform tokenization and sentence detection. Tokenization is performed for word cutting with punctuation. It is done since POS tags are also used by punctuation such as comma, quotation, and also punctuation for sentence terminators: period; question mark.
Sentence detection is performed since the system will do the tagging on per sentence basis. Thus, if the inputs are more than one sentence, the system will detect them and tag every one of these sentence terminators.
Morphological Analysis. The Indonesian morphological analyzer system MorphInd is applied for morphological analysis process. MorphInd system is a finite state-based Indonesian morphological analyzer system [10]. MorphInd uses Foma toolkit for its compiling process. This system will analyze every token in a sentence on unigram basis, thus it does not depend on its surrounding token. Its output takes the form of morphemic segmentation, lemma morpheme position, lexical category, and morphological feature. Figure 2 shows an example of input and output of MorphInd system.

Input
Output

MorphInd input and output example
We will only take the results of lexical category of each token. However, since MorphInd has different lexical category or tagset from the tagset being applied, this output will be changed into the tagset used as indicated in table 1.
Tagging. The system applied for the tagging process is IPOSTagger system. We apply 2 types of tagging model configuration, i.e. HMM bigram and HMM trigram methods as its basic model. IPOSTagger system has also applied smoothing method to deal with sparse data issue, i.e. Jelinec-Mercer smoothing and Linier Interpolation smoothing methods. Figure 3 indicates the tagging process workflow in this subsystem.

Fig. 3.
Sub-system tagging process The input in Figure 3 is the sequence of preprocessing results. Every word in this sequence of words will be figured out from the training corpus whether or not it is an OOV word. If it is not an OOV word, the tagging will be immediately performed, and if it is an OOV word, the POS tag assigned to it is obtained from the previous morphological analysis process.
The end results of this Indonesian POS Tagger sub-system is a sentence which has been assigned with POS tags. Figure 4 shows the system input and output.

Building a System of Indonesian POS tagging for Self-Learning
After preparing the Indonesian POS Tagging sub-system, we develop an Indonesian POS Tagging system for language self-learning.
In order for the system development to go effectively, we use Instructional Systems Design (ISD) with ADDIE model. ADDIE itself is a design framework containing a sequence of general processes using instructional design and training development [13] [14]. This model has 5 (five) phases, namely analysis, design, development, implementation and evaluation phases. These phases constitute the steps towards building Indonesian POS Tagging system for language self-learning.

Analysis Phase
During this phase, we analyze the importance of developing POS tagging capable of producing information on part of speech for language learners. The fairly complicated Indonesian morphology has been quite a challenge for learners wanting to figure out the correct language structure [15]. The morphological information will be even more helpful for those learners speaking a foreign language in their effort of understanding the language structure since it has significant influence in the translation. This POS tagging system is addressed towards language learners who learn the language independently. Learners do not necessarily possess specific skills; they are expected only to get accustomed to operating computer. We have also prepared a clear guideline for learners to install it themselves to make it more user friendly. They are expected only to follow the guidelines appropriately and they could immediately embark on the learning by themselves.
With the aid of this POS tagging system for Indonesia language, it is expected that learners who previously find it difficult to determine the POS tag of a word in a sentence could directly figure out the POS tags in that word. A system taking the form of desktop application compatible with many operating systems would eventually enable learners to learn at any time and any place as long as they have their personal computer devices.

Design Phase
The targeted users of this POS Tagging system are language learners who wish to use this system for self-learning. We consider many backgrounds and ranges of age these learners may come from and attempt to make the design of this POS tagging system easily operable to anyone.
In terms of its display, the system is designed as simple as possible. The window will directly show the main interface which contains text input field, and a main control button to run the system process. The final output of the system is shown in a field right below it.
On its right-hand side, a button to help user delete the input in the input field is available, returning the display to default. The system uses English as its default for its dialog. Figure 5 explains the environment of this Indonesian POS Tagging system to learn the language independently. The procedure to use the system is as follows: Users will be asked to enter the sentence whose POS will be tagged into the main input field, and then they are asked to press the button to commence the tagging process. After the system begins the tagging process, it will show the results in the form of a sentence whose POS has been tagged in the existing output field. If users want to make another tagging, they merely need to press the 'clear' button at the upper right corner, and they could use the system for the tagging process.
From the way the system works, it has two parts, namely frontend and backend. In the frontend part, we apply a system interface as a bridge between users and the system. In the backend part, we apply Indonesia POS Tagging sub-system.

Development Phase
In this development phase, we apply both system design (backend) and interface design (frontend) which have been prepared during the previous phases. To facilitate the application of these two parts, we apply a modular model, i.e. making each system process into modules. Figure 6 shows the modules used in the self-learning of Indonesian POS Tagging system. In the frontend part, the module made is GUI module. This module serves as the system interface. This interface is made using Java Swing, a toolkit in Java programming language for making GUI. Once the interface is made, we test it to ensure that each part functions well.
In the backend part, a module is made for each component of the Indonesian POS tagging sub-system processes. The Preprocessing module runs the input preprocessing task from GUI, the Morphology analyzer module contains input processing from the previous process using MorphInd system, and IPOSTagger module performs the part of speech tagging task using HMM algorithm.
We combine the modules from the frontend and backend parts into a unit. Afterwards, we test it to find out whether or not the system could work as expected. Once the system successfully meets the test feasibility, an executable file is then made for the system so that it can be easily installed in various systems.

Implementation Phase
During this implementation phase, we install the system we have developed into a device which will be used for self-learning. Several system software and module dependences should be fulfilled first prior to the installation of the system.
The software or modules required for the system to work appropriately are Java Runtime Environment (JRE) and Foma. For JRE, every operating system usually has provided its default version. Thus, all that is required is to update it to its latest version. As for Foma, one needs to download it first before installing.
Upon the installation of module dependence, the executable file of Indonesian POS Tagging system for language self-learning could be installed in the device.

Evaluation Phase
Formative and summative evaluations are applied in the development of this Indonesian POS tagging system. The formative evaluation is held at each phase in this ADDIE model. This evaluation aims at revising system errors found and or improving the system quality. The summative one is performed after the latest version of the system is obtained. The main purpose of this summative evaluation is to figure out the quality of service (QoS) and quality of experience (QoE).
During the summative evaluation, we use objective and subjective measurements towards the system being made. In the subjective measurement, Mean Opinion Score (MOS) from respondents is used, where the value indicator is determined within the predetermined scale to measure the subject's opinion about a system's performance [16] [17].
The objective measurement is performed to discover the POS tagging system accuracy level. We use calibration model, i.e. by comparing the result of the tagging system model accuracy measurement with that of another tagging model. The comparison is made with the original model of IPOSTagger system, namely the basic models of HMM bigram, HMM trigram, Affix tree, and Lexicon. We use 3 corpora, the first one contains 10% OOV words, the second one contains 20% OOV words and the last one contains 30% OOV words. Table 2 shows that the results of HMM trigram and Morphology Analyzer (MA) models are higher than other models tested at an average reaching to 92.25%. Meanwhile, the subjective measurement is held to discover the system quality standard based on the perception of those users directly interacting with the system. We prepare several criteria to test the subjective measurement from respondents' opinions. We test these criteria to 24 respondents. They will be asked to assign a score to each criterion based on predetermined rules in each criterion. The final results of this evaluation will determine the system accuracy level for users.
The criteria tested in this evaluation include the design of the user interface menu, the user-friendliness of the system and properness of the system The design of the user interface menu (Test-1). We use MOS scale to find out the level of system design user interface attractiveness. The score scale used is 1 -5.  Figure 7 shows the average results of participants' tests reached 3.61, which indicates that the user interface design used is fairly interesting.
The user-friendliness of the system (Test-2). In this user-friendliness test, the participants will be given MOS score scales 1 -5, respectively 1: Bad (very hard to operate), 2: Poor (hard to operate), 3: Fair (easy enough to operate), 4: Good (easy to operate), and 5: Excellent (very easy to operate). In this evaluation, the final result of the system's user-friendliness test reaches 3.87, indicating that the system is easy enough to operate. Properness of the system (Test-3). This test is held to discover whether or not the system is appropriate to be used for self-learning. MOS scale 1 -5 is provided to participants to find out the subjective score of function properness of the system. 1: Bad (highly improper), 2: Poor (improper), 3: Fair (proper enough), 4: Good (proper), and 5: Excellent (highly proper). Figure 7 shows the participants' average final score, i.e. 3.72, showing that the system is proper enough for self-learning use.

Conclusion
This research shows that the Indonesian POS Tagger system could be applied to assist learners of Indonesian language with independent learning. The system is designed to help assign tags to a sentence with 31 POS tags applied by the system.
During the evaluation phase, objective and subjective measurements are used. The objective measurement is used to assess the accuracy level of POS tagging in the method used by the system. The objective measurement results show that HMM trigram and Morphology Analyzer (MA) methods applied here show higher accuracy than other methods being tested. In the subjective measurement, 3 evaluation criteria are tested to participants, namely user interface design, user-friendliness of the sys- We expect that this system could be applied for word processing devices for POS tagging which could be used to help learners to acquire the language skills through self-learning.