Using Text Mining

— Rapid growth of educational technology services today means that there are more applications in the market. Users may find it hard to choose the most suitable application, so they look for references. Experience shared in the form of text reviews and numerical rating can provide references. Text reviews are particularly specific and so they can provide insights to user satisfaction. In this study, we use text mining and multicriteria decision-making approach to measure the user satisfaction. The data is crawled and collected from seven educational applications: Coursera, edX, Khan Academy, LinkedIn Learning, Quip-per, Socratic and Udemy. Nine attributes are used to measure the user reviews according to quality model of e-learning systems. The result is in favor of Khan Academy, while Quipper is ranked the lowest. The v-values used range between 0 and1 and what is unique is that the rank of Khan Academy and Quipper are not affected by v-value while the ranks of the other applications are. It indicates that Khan Academy has high user satisfaction in terms of utility and low complaint from individuals. Quipper shows the opposite.


Introduction
in detail process steps therein such as data collection, data pre-processing, text mining, sentiment analysis and MCDM-VIKOR in Section 3. In Section 4, we presented the results in the form of table and chart, then explain the results. Finally, Section 5 concludes the paper.

Method
The proposed method can be seen in Figure 1. The data were collected from Google Play Store website. The reviews were crawled from seven educational service apps. Then, the data were pre-processed through a series of stages to make them ready for mining and analyzis. In the mining and analysis, the output is keyword vectors. These are prepared for the next stage where vectors are turned into a decision matrix. In the final stage, VIKOR approach as part of MCDM is run through the data.

Data collection
The data were collected by considering similarities in features. Common aspects were extracted from the review data and used to measure and compare user satisfaction. The reviews about Coursera, edX, Khan Academy, LinkedIn Learning, Quipper, Socratic, and Udemy were extracted from Google Play website. 200 reviews from each application were collected so the total of the data used in this study are 1,400 reviews. The example of online review from each educational app can be seen in Table 1.

Data pre-processing
The data pre-processing transforms unformatted data into understandable formatted data to be processed in the next stages. This starts from data cleaning to remove irrelevant or noisy data. Noisy data are meaningless and cannot be interpreted because there are entry errors or faulty collection. The second stage is data transformation which includes tokenizing, case folding, stop words filtering and stemming. The result from data pre-processing is a collection of stem words that are free from stop words and irrelevant characters or strings.

Text mining
The text-mining process in the current study consists text segmentation, summary extraction, keyword identification, topic detection, term clustering and document categorization. The aim is to compile dictionaries that consist of two types of words: a dictionary of attributes and a dictionary of sentiment words, which is different from the method proposed by [19]. They considered attribute and sentiment words as pairs. They paired attribute and sentiment to determine the polarities. If a sentiment word is labelled positive then its pair word is labelled as positive too, and vice versa. In the current study, this is not the case because each word has been determined its label. It is not based on its pairs. Subsequently, the collection of stem words from the data pre-processing were classified into their part-of-speech (POS) tagging. Ireland and Liu [20] divided several POS-tags. POS-taggers used in this study is Stanford POS-tagger [21]. This POS tagger identifies adjective, adverb, noun, numeral, and verb. The dictionary of attributes is composed of noun phrases only while the rest went to dictionary of sentiment words.
Nine categories were selected to measure user satisfaction, as shown in Table 2. The categories are determined based on functionality, reliability, and usability [22]. After both dictionaries were developed then the words from sentiment words dictionary were classified into five groups of polarities: strong positive; positive; neutral; negative; strong negative. The spectrum of polarity usually consists of three types: positive, neutral, and negative. Dina [23] and Prastyo et al. [24] divided the polarity into three types.
They applied sentiment analysis to determine the experience of hotel customers based on the most frequently mentioned words in its reviews. In this study we used five types of polarity so the result could be more specific and accurate. The polarity values are between 2 and -2 where strong positive = 2; positive = 1; neutral = 0; negative = -1; strong negative = -2. Finally, if there were no sentiment words appear then the polarity value was set as 0. The polarity value was used to construct keyword vectors of the user reviews. These were obtained by multiplying the occurrence attributes and the polarity value of sentiment words from the review data. VIKOR stands for vlseKriterijumska optimizacija I kompromisno resenje. It is Serbian and was first established by Opricovic [25] as one of multicriteria decision making (MCDM) approaches. This method helps to solve a problem by considering the multicriteria [26]. It uses multicriteria ranking index to compare the closeness of each criterion so that the ideal alternative is obtained. A ranking index is obtained by calculating the maximum group utility (Sj) and minimum individual regret (Rj) [27]. There are several steps to be completed using VIKOR, as follows: Establishing the decision matrix. The decision matrix is formed from the keyword vectors or it could simply be defined as the multiplication of attributes occurrence and polarity value for each attribute. Then, the number of attribute instances are counted for each category. The structure of decision matrix can be illustrated as follows: where: Ai : i-th alternative, SNij : the value of j-th aspect for i-th alternative.
Step 2: Calculating the normalized values and putting them into a decision matrix The attributes occurrence (x) from each category from Step 1 is normalized to a value between 0 and 1. The formulation for normalization is shown in (1). The weighting scheme was applied to the data to measure the weight of each category (wk) and it is calculated using (2) Step 3: Calculating the best fj * and worst fjof all criteria The positive (fj * ) and negative (fj -) ideal solution is obtained using (3). fj * = maxi fi, j and fj -= mini fi, j Step 4: Calculating new decision matrix with the weight (wj) The formula to assign the weight (wj) for a new decision matrix is written in (4).
Step 5: Calculating the values of the group utility (Si) and individual regret (Ri) The calculation of utility measure (Si) and regret measure (Ri) are done using (5) and (6) where wj is the weight of the category.
Step 6: Calculating the index value (Qi) by using (7), where S * = maximum value of Si; S -= minimum value of Si; R * = maximum value of Ri; R -= minimum value of Ri; v = index weight value.
Step 7: Ranking the order preference of index value (Qi) The smallest the index value, the better the solution is. Results and discussion Table 3 displays the calculated score for all the categories (Cn) as described in Table  2. There are positive and negative values for the calculated scores. If the score is negative, then the number of negative polarities is bigger than the positive polarity in those categories. The last row shows the number of term instances in each category. Table 4 is generated from Table 3; the normalized score is obtained by applying formula in (1) while the weight from the last row is calculated by using (2).   Equation (3) was applied to determine the positive ideal solution (fj * ) and negative ideal solution (fj -). The positive ideal solution was obtained by comparing the normalized value in Table 4. The maximum value is regarded as a positive ideal solution (fj * ) and the minimum value is a negative ideal solution (fj -). Both values are presented in Table 5. Next, Table 6 shows the results of the new decision matrix after multiplying the normalized values by the weight.
The index value (Qi) determines the rank of each app based on the calculation formula in (7). The lowest index value shows the highest user satisfaction of various criteria. The Qi value involves the calculation of utility measure (Si) and regret measure (Ri). Equation (5) and (6) are used to calculate both values. From Table 6, it can be seen that the specific criteria that need to be improved most is course (C7), video (C3) and content (C6) with reference to their Ri values in Table 7. From the results given in Table 7, the sensitivity analysis was conducted to rank the highest and the lowest index values of seven educational apps. The v-value range from 0 to 1. Figure 4 shows that the ranking of Khan Academy and Quipper are not affected by the v-value. Khan Academy was ranked as the number one while Quipper was ranked as the last. It means that Khan Academy has high user satisfaction in terms of maximum group utility and minimum individual regret while Quipper is the opposite. Meanwhile, the rank of Coursera and Udemy are lower when the v-value is increased. If Coursera and Udemy focuses on minimum individual regret, then both apps will have higher user satisfaction. On the other hand, LinkedIn Learning and Socratic's rank are higher when v-value is increased. It means that the user satisfaction of LinkedIn Learning and Socratic score will be higher if the maximum group utility is increased.

Fig. 4. Sensitivity Analysis
However, this study has several limitations such as the number of apps covered as case studies are only seven. Increasing the number of apps will improve the robustness of this study. The total number of reviews which are crawled from the website is only 1,400. Considering the high volume of reviews online, the number of text reviews could be increased in order to capture the trends more accurately. In addition, future research will also benefit from combining both star rating reviews and text reviews to generate a more holistic measurement system. The categories selected in the current study are based on ISO 9126 as quality model for the e-learning systems. It could be compared with other quality models as well. Last, the MCDM approach used in this study is only VIKOR. Future study needs to be conducted to do a comparative analysis involving other MCDM methods in order to corroborate the findings.

Conclusion
This study proposes a framework for measuring user satisfaction of educational apps based on the user reviews. The data used in this study is actual user reviews. Text mining is used to extract lexical attributes from the reviews and a sentiment analysis is conducted to determine the word polarity. After doing text mining and sentiment analysis, then the word vector was measured using an MCDM algorithm, namely VIKOR. The weight of each criterion in VIKOR is obtained by calculating the term occurrence. It is used to help reducing subjectivity in the evaluation. From the seven educational apps, it can be concluded that Khan Academy shows the highest satisfaction and Quip- per shows the lowest. The result shows that user satisfaction is also influenced by maximum group utility and individual regret. The future study should be conducted using another MCDM method to validate the result. The number of the data crawled should be added to improve directory of attribute and sentiment word. The categories also should be wider than nine categories in this study.