Utilizing Text Mining and Feature-Sentiment-Pairs to Support Data-Driven Design Automation Massive Open Online Course

— This study aimed to develop a case-based design framework to analyse online user reviews and understanding the user preferences in a Massive Open Online Course (MOOC) content-related design. Another purpose was to identify the future trends of MOOC content-related design. Thus, it was an effort to achieve data-driven design automation. This research extracts pairs of keywords which are later called Feature-Sentiment-Pairs (FSPs) using text mining to identify user preferences. Then the user preferences were used as features of an MOOC content-related design. An MOOC case study is used to implement the proposed framework. The online reviews are collected from www.coursera.org as the MOOC case study. The framework aims to use these large-scale online review data as qualitative data and converts them into quantitative meaningful information, especially on content-related design so that the MOOC designer can decide better content based on the data. The framework combines the online reviews, text mining, and data analytics to reveal new information about users’ preference of MOOC content-related design. This study has applied text mining and specifically utilizes FSPs to identify user preferences in the MOOC content-related design. This framework can avoid the unwanted features on the MOOC content-related design and also speed up the identification of user preference.


Introduction
Massive Open Online Course, often abbreviated to MOOC, is online courses held by a university with the option of free or open registration. The teachers are faculty members of the university or general practitioner in their fields. Classes are conducted by weekly lectures using videos, online assessments, and discussion forums. However, some educators assume that the MOOC's quality of learning is different from faceto-face class because it cannot replace face-to-face classroom engagement, laboratory, fieldwork, and any other aspect [1][2] [3].
The purpose of Data-Driven Design is to design a system based on the provided data. The pattern of the available data is discovered using data mining algorithms [4]. The result will be used as a new design system. According to [5], the challenges of Data-Driven Design usage as a product evaluation process are data structure, data understanding, incomplete information, nature of the data, and individual cognitive limitation. The data are often unstructured and heterogeneous; thus, it is mandatory to process the data into information. Hence, the main challenge of data-driven design is to transform the large-scale unstructured data into meaningful information so that people are able to utilize the knowledge.
Text mining in the context of this study is to transform unstructured text into vectors of numbers. It can be said that the unstructured text refers to the qualitative data, and the vectors of numbers are the quantitative data. Later, those vectors will be used as input for a process on the framework. The process will involve a data mining algorithm to predict classification results [6]. Multidisciplinary fields have applied text mining to help their business process, for example, knowledge management, customer care service, social media data analysis, and spam filtering. In 2020, [7] applied text mining in the tourism sector which found out the most frequent words that described the negative and positive experience while staying in a particular hotel. It helped the hotel's management to improve its customer care services. Meanwhile, [8] in 2019 proposed a novel approach using semantic-based Naive Bayesian classifier in text analytics to filter spam emails. Text mining still has the potential to be explored in further studies and can be applied directly to solve practical problems in many fields.
This study aimed to help the designer of the MOOC learning platform to understand the users' needs without having to browse the online reviews every time. The users here are the students who use MOOC as a learning platform. The MOOC learning platform we used was Coursera. Then, the users' online reviews were extracted from Coursera in the review section. Then, a design automation system was built to analyze the users' online reviews and obtain features that appear most often on the online review data. Text mining was used to acquire keyword then it was paired as Feature-Sentiment-Pairs (FSPs). Finally, the result of the system was referred to machine model and it was compared to human-generated result that was called the human model. The purpose of the comparison was to find out the machine model's performance. The human model was used as a control variable.
The organization of the article is written as follows: Section 2 explains the method used of the proposed framework. Section 3 represents the result and it is followed by Section 4 that is the discussion. Finally, the conclusion of this research is written in Section 5.

Methods
The proposed framework is illustrated in Fig. 1. The framework aims to find out the difference between the machine model and the human model. There are four steps to generate FSP automatically.

2.1
Collecting the data The data were collected using Webharvy Web Scrapper [9] from the Coursera website. The example of the review page of Coursera can be seen in Fig. 2. Webharvy needs the URL page of each course's online review where the original dataset in. The users determine the attribute that would be crawled. In this research, we crawled "online review", "course's name" and "star review". Table 1 gives a sample of the crawled data. Besides these attributes, there were other attributes given on the page such as "username" and "date". However, we excluded them since we did not categorize the online reviews based on those attributes. Fig. 3   The number of online review data crawled in total was 9677 reviews. To balance the data based on the sentiment review, they were divided into two types, namely negative and positive sentiment. The sentiment was used as a label of each review datum. The label had been manually labeled based on the star review in each review: 1-2-star review was categorized as a negative sentiment while 4-5-star review was a positive sentiment. The categorization was based on [10]. Table 2 shows the total number of data.

Data pre-processing
Data pre-processing is the earliest step in which the data are being transformed into some features so that the algorithms are easily interpreted. Data pre-processing in text mining area includes four basic sub-processes, such as:

1) Tokenize 2) Stop words removal 3) Stemming 4) Transform cases
Tokenize is a process when a sentence was split into each word so that the review data for each record are divided into the number of words. Meanwhile, stop words removal removes the unimportant words include prepositions, articles, pronouns, numbers, and punctuation. Stemming is a process where words are breaking down the form of a word into its basic word form and deleting the suffix and unnecessary characters. Finally, transform cases means converting all the characters into upper case or lowercase.
The pre-processing step in this research had been done using RapidMiner tools. RapidMiner is open-source software for data mining, and it is popular among data scientists for its ease of use. Several studies in the text mining often use RapidMiner, such as the studies conducted by [11] [12]. Even though they used the data taken from different websites, such as Glassdoor web and Twitter web in which their data structure was different, Rapid Miner could process the data well. The example of the data pre-processing step in RapidMiner is illustrated in Fig. 4.

Machine model of sentiment analysis
There are three phases to complete this machine model:

1) Tag words types 2) Determine FSPs 3) Train the model
The objective of this machine model is to form Feature-Sentiment-Pairs (FSPs) from Noun, Verb, Adjective, and Adverb.

Tag word types via Part-of-Speech (POS) Tagger
The POS tagger used in this research was Stanford POS Tagger according to [13]. Stanford POS Tagger is software that can analyze the English text and determine its part-of-speech. The advantage of using Stanford POS Tagger is that it can be implemented using Java and Python languages. Table 3 shows the assumption of POS tags according to [14]. The POS tags were divided into nouns, verbs, adjectives, and adverbs. Other elements of part-of-speech such as prepositions, articles, and pronouns had been eliminated in the pre-processing phase called sub-process stop words removal [15].

Generate feature-sentiment-pairs from sentences
Feature-Sentiment-Pairs (FSPs) consist of Feature and Sentiment. Feature was from Noun and sentiment was from Verb, Adjective, and Adverb. They were paired after the data had been labeled as a positive and negative sentiment. If there are two Nouns written in the same review data, then both Nouns are paired with all the sentiment words [14]. Lastly, the FSPs were counted based on their appearance in the review data. The FSPs were also divided according to their label.

2.6
Train the model to predict sentiment The Support Vector Machine (SVM) Algorithm is a classification algorithm that is considered as the best text classification method [16]. It is categorized as a supervised classification. The data were divided into several elements, and then each element was placed in n-dimensional space as vector coordinate [17]. A hyper plane is formed to distinguish each class. Fig. 5 illustrates that there is more than one hyper plane but only one that is the most optimal. The most optimal hyper plane is colored by the red line. It is called as an optimal hyper plane because it perfectly divides the classes. The optimal hyper plane is obtained by calculating the largest distance between the closest points of each class [18]. The hyper plane can be presented algebraically using Eq. 1 [17].
Eq. 2 is considered as training hyper plane where x is the closest point from the hyper plane. The closest value to the hyper plane is referred to support vectors. The length between a point xi and the hyper plane (ri) is calculated using Eq. 3. The formula of the margin (M) is shown in Eq. 4.
It should be noted that SVM requires more data training so that the accuracy could be higher [19]. SVM is better than other classification algorithms if multiple classifications in the large dataset are involved. The pseudo-code of SVM is shown in Table  4.  In general, the order of the text classification using SVM is term weighting, data training, and data testing. Term weighting used the occurrence number of terms in the documents. Term weighting is often called feature extraction. Feature extraction is used to transform a text document from any format into a list of features/terms that can be easily processed by text classification techniques. Feature extraction is one of the significant pre-processing techniques in the text classification that computes features/terms value in the documents. There are several methods of term weighting/feature extraction, such as Term Frequency (TF), Inverse Document Frequency (IDF), and a combination of TF and IDF which is called TF.IDF. Term Frequency (TF) is the frequency of occurrence of the term (t) in the document (di) as shown in Table 5. For example, the term "Increase" appears in four documents, and each document has a different number of Term Frequency (TF). Meanwhile, Document Frequency (DF) is the number of documents in which a term (t) appears. From Table 5, DF was calculated as illustrated by Table 6. After the DF value is obtained, then the IDF value is calculated by Eq. 5. The result of TF.IDF is obtained by multiplying the TF and IDF as seen on Eq. 6. The results of IDF and TF.IDF are shown in Table 6. The result of weighting is the formation of feature vectors by looking at the existence of a word in the document.   Furthermore, a feature vector space was formed. Feature vector space was obtained by counting the number of terms present in the entire document. If there are 5 different terms in the entire document as shown in Table 5, then the feature vector will have 5 dimensions, where each part of the vector represents one term. Vector dimensions can be reduced by deleting insignificant words. After the vector space is formed, the vectors representing a document were formed using one of the weighting methods. In this study, TF weighting was chosen as weighting method. If in the first document there are 1 term "Poor", 2 terms "Short", and 5 terms "Teacher" then the feature vector for the first document is x2 = 1, x3 = 2, and x5 = 5. After the feature vectors were formed, complete with their respective labels, they are ready to be inserted into the SVM to be used as data training [20]. The output of the SVM from the training process is the best hyperplane to be used as a classifier. The formula of hyperplane can be seen on Eq. 2. At the time of testing, the feature vectors used as testing data are entered into the SVM without labeling them. The output of the test results is in the form of a classification result class label. The class labels in this study determine the sentiment of the review whether it is positive or negative sentiment. To check the accuracy of the classifier, the output label is then compared with the original label as described in sub section 4 about evaluation of classification performances.

Evaluate classification performance
The classification performance is evaluated using a 2x2 confusion matrix according to [21]. The accuracy, precision, recall, and F1-score were calculated according to Table 7 and Eq. 7-10. The confusion matrix consists of actual class and predicted class. Accuracy calculates the number of true positive and true negative in the actual class and it is divided by the sum of total observation. Total observation is the sum of all the numbers in the confusion matrix as seen in Table 7. Meanwhile, precision calculates the ratio of true positive of the total number of true positive and false positive. Recall is the ratio of true positive of the total number of true positive and false negative. F1 Score is the mean of the precision and recall.

Human model of sentiment analysis
In this research, we asked for annotators' assistance to model the sentiment analysis. The annotators were involved to label the review data and determined the FSPs. The results of the annotators' analysis was used as a control variable.
Interpret data: Annotators observed the sentiment analysis categorization apart from the results of star review generalizations from the crawl review data. In this phase, the annotators checked whether the sentiment analysis was in accordance with the review data.
Infer patterns from data: Unlike machine models, annotators could find synonyms and sarcasm. For example, there were two reviews: "great professors" and "good teacher". The annotators knew that both of the reviews were identical, "professor" means the "teacher" who taught the course. The annotators formulated the FSPs easier than machine model.
Measure the frequency of feature-sentiment-pairs: The frequency for every FSPs' appearance in the review was counted. To find out the significance of the FSPs, the easiest way was to calculate its appearance on a review. The output of this annotators' model was to select the highest frequency of FSPs.

Sentiment classification result
The model was trained using the SVM algorithm. The review data were divided into two groups: Data training and data testing. For evaluation purposes, accuracy, recall and precision were calculated. The result is shown in Table 8. Ratio means the ratio of data training and data testing. In the first experiment, data training and data testing were on the same ratio, 50% of the total amount of the data. In the second experiment, data training was 80%, and data testing was only 20%. The evaluation result was different and the best result was when the ratio was 80:20. If the total data were 9677, then the number of data training on 80:20 ratios was 7741 and the number of data testing was 1936. The result of the machine model is illustrated in Fig. 6 and Fig. 7. The y-axis from both charts shows the frequency of negative and positive FSPs. The x-axis represents the FSPs. The most frequent negative FSPs in Fig. 6 indicate that the course was basic, poor, difficult and bored. Fig. 7 shows the most frequent positive FSPs are great, good, understand, and interesting.

Result of machine model sentiment analysis
As displayed in Fig. 6, the users of the course thought that the course was too basic as they expected. They also thought that the course was poor and boring. Also, they faced difficulties while following the course. The positive features can be seen in Fig.7. The users felt that the course was great and good. Since both words are synonym, so it is implied that the course was good enough. They found that the course was understandable and interesting to be followed. The positive and negative features are contrary to each other, so the results are compared with numbers. For example, the word "good" is the antonym of "poor", but the frequency of the occurrence of "good" is higher than "poor". This also applies to the words "understand" -"difficult" and the words "interesting" -"bored".

Result of human model sentiment analysis
The result of the human model is illustrated in Fig. 8 and Fig. 9. The y-axis from both charts shows the frequency of negative and positive FSPs. The x-axis represents the FSPs. The most frequent negative FSPs in Fig. 8 indicated that the course only offered a few materials and they were difficult to comprehend. Fig. 9 shows the most frequent positive FSPs are course was good and interesting, and the introduction was great.
As displayed in Fig. 8, the users thought that only a few materials provided in the course, which did not meet their expectations. They also thought that the course was bad and its material was difficult. The positive features can be seen in Fig. 9. The users felt that the course was good and interesting. They found that the introduction of each course was great. The positive and negative features are contrary to each other, so the results are compared with numbers. For example, the word "good" is the antonym of "bad", but the frequency of occurrence of "good" is higher than "poor".

Discussion
The Machine Model and Human Model were compared to measure the success of the framework. The Human Model was also referred to ground truth because the result of the Human Model came from the expert analysis. The success of the framework was potentially shown by the similarity of the Human Model and Machine Model. The positive features produced by the Machine Model were "course-good", "course-interesting", "course-easy", "course-understand", "course-recommended", and "material-good".
Those positive features were also the most frequently occurred FSPs in the Human Model. Even though the number of each positive feature such as "course recommended" in the Machine Model does not exactly have the same number in the Human Model, the frequent FSPs in the Human Model and Machine Model are mostly the same. As illustrated in Fig. 7 and Fig. 9, most of the positive FSPs in the Machine Model appear in the Human Model. Similarly, in the negative features, the most frequently occurred FSPs produced by the Machine Model are also nearly the same as those produced by the Human Model. Further study needs to be carried out to improve the accuracy of the number of FSPs in both Machine Model and Human Model.
The result of FSPs from the Machine Model towards the Human Model proves that the framework managed to transform the unstructured data into meaningful information. Since the data were used as the main source in decision making in this framework, the approach used in this framework was data-driven design. The datadriven design made decisions regarding the development of content-related design and system design fully based on the data collected from the MOOC, specifically the data about how MOOC users interact with the system as seen on MOOC discussion forum.
There were similar studies which used the same data from MOOC discussion forum. For example, Wise et al in 2016 developed a linguistic model to categorize and identify the post in MOOC discussion forum whether or not they are substantially related to the course content by searching for predefined keywords [22]. It helps the instructors to identify the content-related question from the learners, so that the instructors could improve their course material. However, the predefined keywords which they proposed were only for a certain course. Thus, if they changed the course, then the keywords had to be changed according to the course. Brinton et al used large scale statistical analysis of forum discussion in Coursera [23]. They investigated the user behaviour on the forum discussion and looked for the most course-relevant discussion. They ranked the discussion topic based on its relevance with the course using a unified generative model. Gamage et al [24] also studied user behaviour as done by [23]. The difference was the method used.
[24] applied an ethnographic method by using a deep interview with two groups of participants. The first group never used MOOC, and the other one used MOOC. From the deep interview, the researcher could conclude the recommendation for MOOC design. Agrawal et al made Stanford MOOCPost Dataset which was tagged manually according to six dimensions including confusion, question, answer, opinion, sentiment, and urgency to address the confusion in the MOOC discussion forum. They also gave a recommendation to the user to open instructional video clips [25]. They classified the confusion by using "tracking log data" from log learners' actions [25].
The limited number of datasets in this study was under 10000 online reviews, but the Machine Model was able to produce information to support the decision-making process in MOOC. The data itself were divided into equal number of positive and negative labels as shown in Table 2. The number of data of each label should be equal to avoid deviation towards the dominant label. The accuracy, precision, recall, and F1 score were above 80%. The value was high enough even though the number of the dataset was relatively moderate. Thus, adding more data is compulsory in further studies to obtain higher accuracy, precision, recall, and F1 score. The framework transforms qualitative online review data into quantitative data. Using this framework, the designer does not need to do a manual analysis to identify user preference for each feature in MOOC. All the decision was generated automatically based on online review data as an input.
The quantitative result helps the designer to find words and their frequency that often appear in the review data. It will give suggestions to the designer to add new features or improve the existed features in MOOC especially content-related features. Although using quantitative results is easier to validate and analyse the data, it also works well in large scale data. However, there are several weaknesses in using quantitative results. Unlike qualitative data, quantitative did not have more specific information because the sentences from online review data were separated into each word. Since they had been separated, it never revealed the causation on why the user indicated the positive and negative review about particular features. FSPs only paired the adjectives and nouns that were contained in a sentence but it was lack of details, and it never revealed the causation, too. The designer used FSP to identify certain popular comments about content-related features on the data. Since FSP consisted of adjectives and nouns, the designer was able to conclude which features that received a positive response and vice versa.
There were differences between the results as shown in Fig. 6-9. If we compared the result between Machine Model and Human Model, the Human Model was always more accurate than Machine Model in terms of grouping the FSP. Some of the causes of computational processing errors include: 1) Failure to produce "stemmed words" 2) Incorrectly giving tagged word of "part of speech" to each word 3) Failure to generate FSP in a sentence The first cause often happened when the stemming algorithm failed to generate stem words. There were some stemming algorithms that were often used. There were Porter's Stemmer, Lovins Stemmer, Dawson Stemmer, Krovetz Stemmer, and Paice/Husk Stemmer [26] [27]. Among some stemming algorithms, in this study, we used Porter's Stemmer because it produced the best output as compared to other stemming algorithms and the error rate was quite low [26] [27]. The weakness of Porter's Stemmer which often happened and still could not be avoided was overstemming when the suffix was erased, and it caused a change in the meaning and wrong-stemming when the part of the root was also deleted. The second cause was incorrectly giving tagged word type. In this study, we tagged the words only for Noun and Adjective because those were the main elements of FSP. Since FSP consisted of "features" and "sentiments", features were derived from words labelled with nouns while sentiments were derived from words labelled with adjective. As the process of tagging the word type was carried out after the data pre-processing, the output of data pre-processing greatly affected the result of tagging word type. Finally, FSP was often missing from the sentence. It means that, in a sentence, there were no Noun and Adjective. If both Noun and Adjective were not in a sentence, then that sentence was not considered as a dataset. It could reduce the number of valid data. Because of these flaws, there was a possibility of valid data reduction, and it affected the number of FSP on the result. [28] offered a solution to improve the Machine Model's accuracy to pair sentiment with features. [28] measured the distance between Noun and Adjective. This strategy will be very suitable to be applied in further studies. Another limitation of this study was the number of dataset which was under 10000 data. The dataset was decreased after some processes were applied. Due to the moderate dataset, there were many missing data. The greater quantity of the dataset will increase the accuracy and the information will be more varied for the designer.

Conclusion
The proposed framework produced useful information for supporting data-driven design in MOOC. MOOC could use FSP as an evaluation of its content. It supports data-driven decision making in content-related features. The sentiment analysis classification had been done using Support Vector Machine, and the number of accuracy, precision, recall, and F1 score was above 80%. The positive and negative FSPs had successfully described the user preferences in MOOC, especially in Coursera. The output was a list of specific features and the frequency of the FSP so that it could be ranked based on the appearance of the most frequent FSP. The designer could evaluate MOOC content-related features to improve the product's content features. Further study is needed to improve the accuracy of the result of Machine Model by adding more dataset and reducing computational processing errors.

Authors
Nasa Zata Dina is a lecturer in the Department of Engineering, Faculty of Vocational Studies at Universitas Airlangga, Surabaya, Indonesia. Her research interests include data mining, text mining, and engineering education. She is also an editorial board member of Journal of Information Systems Engineering and Business Intelligence.
Riky Tri Yunardi is a lecturer in the Department of Engineering, Faculty of Vocational Studies at Universitas Airlangga, Surabaya, Indonesia. His research interests include medical electronics, robotics, image processing, artificial intelligence, and engineering education.
Aji Akbar Firdaus is a lecturer in the Department of Engineering, Faculty of Vocational Studies at Universitas Airlangga, Surabaya, Indonesia. His research interests include power systems simulation, power systems analysis, power systems stability, renewable energy, artificial intelligence, and engineering education.