Paper—Application for Identifying Students Achievement Prediction Model in Tertiary Education Application for Identifying Students Achievement Prediction Model in Tertiary Education Learning Strategies for Lifelong Learning

The purpose of the research is to identify the risk of dropping out in tertiary students with an application. The components of the research goal aim (1) to develop the students’ achievement prediction model and (2) to construct a prototype application for the predictions of the tertiary students dropping out. The research tools consisted of three parts, (1) tool for developing predictive prototypes uses a tool called the CRISP-DM process with Decision Tree Classification, Feature Selection methods, Confusion Matrix performance, Cross-Validation methods, Accuracy, Precision and Recall measurements, (2) tool for application development used the SDLC with V-method, and (3) tool to assess application satisfaction used questionnaires and statistical analysis. Data sample were collected from 401 students enrolled in the Business Computer Program at the School of Information and Communication Technology, University of Phayao during the academic year 2012–2016. The results showed that the prediction model had a very high percentage of accuracy (82.29%). The prototype test results with the data gathered had a very high score level (84.04%; correct 337 out of 401 training examples). An overview of the underlying application with the utmost integrity by the researchers planned to put the application to the test in the first semester of the academic year 2021 at the School of Information Technology and Communication, University of Phayao. For future research, the researchers plan to create a mobile application for mentors in the University of Phayao to monitor learner on both Android and iOS systems. Keywords—learning analytics, dropping out, educational data mining, eruptive technology, disruptive technology


Data collection
The data collection was collected from 401 students at the School of Information and Communication Technology, University of Phayao. The data collected are the students' academic performance of the Business Computer program during the academic year 2012-2016. The variables of the collected data consisted of course details, academic results in each course, and academic achievement status in the Business Computer program during the academic year 2012-2016. The researchers summarize the collected data as shown in Table 1 and Table 2.

Modeling
The model of CRISP-DM (Cross-Industry Standard Process for Data Ming) is a popular model in the field of data mining analysis. In this research, it was used to develop the students' achievement prediction model. It has six productive work processes: business understanding, data understanding, data preparation, modeling, evaluation, and deployment. An overview of the CRISP-DM model process is shown in Figure 2. Business understanding. The first step is Business Understanding. The objective is to present the context of the problem with relevant goals and information so that researchers can connect the data in a business model or research question. It is composed of four sub-stages: determine business objectives, assess situation, determine data mining goals, and produce project plan [11]. Whereas the product of this process is that the research team understands the context of the research project, which aims to create a model that is significant to predict the academic achievement of students at the tertiary level.
Data understanding. The second step is Data Understanding. The objective is to know exactly what can be expected and what can be achieved from the data. It examines the quality of the data in several aspects such as data integrity, data fragmentation, data density, or dataset compliance values for governance and monitoring. It is composed of four sub-stages: collect initial data, describe data, explore data, and verify data quality [11].
For this step, the understanding and awareness of the research. The researcher summarizes the problems that arise in the Business Computer program at the School of Information and Communication Technology, University of Phayao.
The researchers found that the number of students enrolled in Business Computer program tended to decline significantly in relation to the declining population in Thailand. On the other hand, the number of dropout students has increased as shown in Figure 3.  Figure 3 shows the number of students who enrolled in the Business Computer program, the number of students who graduated, and the number of students who dropped out. The data collected shows information during the academic year 2001-2020. It reflects the problem of lower enrollment and the high dropout problem. However, the researchers had compiled and summarized the data, as presented in Table 1 and Table 2.
Data preparation. The third step is Data Preparation and involves the ETLs (Extract-Transform-Load) or ELTs (Extract-Load-Transform) process. It serves parts and format of the data into something useful can be used in the algorithms and data mining processes for modeling. It is composed of five sub-stages: select data, clean data, construct data, integrate data, and format data [11].
For this research, data preparation starts with research ethics. Research projects are authorized under Project Code 2/020/63 on April 22, 2020 by the University of Phayao. After that, the researchers received data from the Division of Educational Services, the University of Phayao. The data received is a transaction of 254,456 students in all educational programs of the School of Information and Communication Technology from the academic year 2001 to academic year 2020.
In the next step, the researchers grouped students affiliated with the Business Computer program. It was found that there were five groups of students affiliated with the Business Computer program. The first group was 397 students enrolled during the academic year 2001-2003. The second group was 588 students enrolled during the academic year 2004-2007. The third group was 532 students enrolled during the academic year 2008-2011. The fourth group was 401 students enrolled during the academic year 2012-2016. The fifth group was 124 students enrolled during the academic year 2017-2020. The grouping is based on the program version, which is updated every 3-5 years. In the process of selecting the sample, the researchers selected a purposive sampling method by picked the sample from the fourth group, as it was the latest version with the entire process completed. The reason for not selecting the fifth group is because most of the learners are studying in the program.
From Table 1 and Table 2, it is clearly shown that most learners have dropout problems in Year 1. Therefore, the researchers extracted data on the achievement results of the learners only from the 1 st semester and the 2 nd semester from Year 1, which consisted of 13 courses for the analysis of educational achievement prediction. The data provided has already been shown in the link: https://bit.ly/3xVYe8s.
Modeling. The fourth step is Modeling, which is the core of any data mining project. This step is responsible for the project objectives that should meet or help achieve project goals. However, the selection of the modeling method is important, as it is necessary to select the right tools for the purpose and goals. This process is therefore made up of four sub-stages: select modeling techniques, generate test design, build model, and assess model [11].
The research consisted of two parts of the three techniques. The first part is the Decision Tree modeling techniques, which are beneficial to understand and can be utilized in a variety of ways [12]. The second part is to improve the model to be more efficient. The tool is used by the Feature Selection (FS) techniques [13], [14], it has the ability to discover the characteristics that influence the model. It can be used to reduce the number of variables for better model performance [13]. The techniques used in this part consist of two techniques: Forward Selection technique, and Backward Elimination technique. The Forward Selection technique is stacked with a subprocess. Begin the calculation by selecting an empty attribute, and each new calculation cycle adds an individual attribute of the given data set to determine model performance. In contract, the Backward Elimination technique starts the calculation with the entire set of attributes, and each round removes the existing attributes one by one and calculates them all.
All three techniques were tested in detail to determine the model's performance using confusion matrix performance, cross-validation methods, accuracy, precision and recall measurements as presented in the Evaluation section.
Evaluation. The fifth step is Evaluation. Its main function is to determine that the results are accurate and the integrity of the models. In case of incorrect results, this step is suggested to review and go back to the first step to understand the problem and figure out why the results are not correct. It consists of three sub-stages: evaluate results, review process, and determine next steps [11].
Review for evaluation at this stage the researchers used to test theoretical computer science. It contains of confusion matrix performance, cross-validation methods, accuracy, precision and recall measurements.
The confusion matrix is one of the simplest and easiest metrics to use to find model performance, comprised of three indicators: accuracy, precision, and recall. It is used for analyzing classification problems requiring results from two categories (classes). The calculation formula and the relation of the confusion matrix performance are shown in Figure 4. In order to find more efficient models, the cross-validation methods were used in this research. The principle of the cross-validation methods is to divide the dataset into two parts. The first part is the data set for training (modeling). The rest of the data sets were prepared to test the models obtained from the first part. The concept of dividing data for model performance testing is presented in Figure 5. Deployment. The sixth and last step is Deployment. It contains of presenting the results in a useful way. It can be understood, and when this goal is achieved, the project should achieve it. It is composed of four sub-stages: plan deployment, plan monitoring and maintenance, produce final report, and review project [11].
This research recognizes the importance of model deployment. Therefore, the researchers have applied the most efficient model to develop into an application prototype by presenting details in the topic of application construction.

Application construction
In developing a good application, the developers should discover an effective of Software Development Life Cycle (SDLC). The researchers found that one of the more popular SDLCs is the V-Model [15], [16]. The V-model is a type of SDLC that has a V-shaped process to carry out activities in which all activities are assessed sequentially. It is also known as "the verification and the validation model" [15]. It consists of three phases and nine steps, as detailed and presented in this section and Figure 6.  Figure 6 shows the processes and steps of the V-Model, which is a type of software development life cycle model. It consists of three phases: the design phase, the coding phase, and the testing phase, as detailed below.
Design phases. The design phase is the preparation phase for application development. It consists of four sub-phases: requirement analysis, system design, architectural design, and module design.
For this research, the design phase is the application of a model that has been performed in the previous process to design an appropriate application. The application development process is based on the principles of application development, the details are as follows: Requirement Analysis: This phase consists of detailed communication with previous studies to understand the requirement and expectations of the research problem. This step can be called "Requirement Gathering".
System Design: This phase consists of the system design and hardware design setup. It considers a complete communication model for product development in which the researcher uses responsive technology for the convenience of the user.
Architectural design: System design is a module with different functional classification. The transfer of data and communication between the internal module and the outside world is clearly understood. In designing the program, the researcher provides the user with the pleasure of providing data and consent to the use of the application, in which the data the researcher collects in the application is based on and ethical research.
Module Design: At this stage, the system is divided into smaller modules, a detailed module design, known as Low-Level Design (LLD). For this research, the researcher designed an intuitive user interface with the simplest processes for the best benefit of the user.
Coding phase. The coding phase is a critical step, with the researchers selecting the appropriate tools for the technology and prototypes from previous analyzes. The database program used in this research was MariaDB. The computer language cording for application development is PHP, while the main framework is Laravel Framework.
Testing phases. The testing phase is aimed at tracking each activity that takes place in the application development. It consists of four steps: unit testing, integration testing, system testing, and user acceptance testing.
Unit testing: Unit testing plans are developed during the module design phase. These unit testing schemes are conducted to eliminate any code or unit level bugs.
Integration testing: After completion of the unit test, the integration testing is performed. To test the integration, the modules are put together and the system is tested. Integration testing will be performed in the architectural design phase. This test verifies the communication of the modules with each other.
System testing: Testing of a complete application testing system with functionality, interdependence, and communication. It is a test of the functional and non-functional requirements of the developed application.
User acceptance test (UAT): UAT is performed in a user environment similar to a production environment. UAT verifies that the delivered system meets user requirements, and that the system is world ready.

Research results
This section is the reporting of research results, divided into three parts: data collected and data sampling, academic achievement prediction model, and prototype of the prediction application model.

Data collected and data sampling
The data collected is a transaction of 254,456 students in all educational programs of the School of Information and Communication Technology from the academic year 2001 to academic year 2020. In this research, the sampling scope of 401 students enrolled in the Business Computer Program at the School of Information and Communication Technology, University of Phayao during the academic year 2012-2016, as discussed in data preparation.  Table 1 shows the student data that was collected. It was classified annually during the academic year 2012-2016. The data collected clearly reflect the problem, with 167 high dropout students (41.65%). In addition, there are 65 students (16.21%) who fail to complete the required study period. Moreover, the huge number of students who were dropped out appears in Year 1 as shown in Table 2.  Table 2 shows the dropped out data of students in the Business Computer program during the academic year 2012-2016. It found that 98 of the students (58.68%) dropped out during the 1 st year, and 2 nd year, there are 50 students (29.94%) who dropped out. From their findings, the researchers decided to select specific variables or courses in Year 1 as attributes to the analysis model. There are 13 courses as shown in Table 3.  Table 3 shows the attributes (variables) to be used in the analysis to develop the achievement prediction model. It consists of thirteen courses in two types of courses: general education courses and major courses. These attributes were analyzed by the resulting model, which is presented in the next section.

Academic achievement prediction model
The models obtained from this research were presented from three techniques: Decision Tree classification technique, Forward Selection technique, and Backward Elimination technique.
Results of the decision tree classification model. The analysis results of the Decision Tree Classification model were presented in two dimensions. The 1 st dimension is the model analysis reports as presented in Table 4. The 2 nd dimension is the prototype model with the highest efficiency and highest accuracy as shown in Table 5.  Table 4 shows the analysis of Decision Tree modeling. It found that the Decision Tree model at a depth of level 2 and a test of the most effective of the leave-one-out cross-validation method is equal to 78.30% of the highest accuracy. The test results are shown in Table 5.  Table 5 presents the details of the most effective Decision Tree models. The details are from Table 4, where it was found that the Decision Tree model with depth 2 and tested with cross-validation method by leave-one-out was the most effective. It has the accuracy rate equal to 78.30%, the precision rate equal to 77.32%, the recall rate equal to 88.89%, and the AUC rate equal to 49.39%. It can be concluded that this model is suitable for this approach.
Results of the forward selection model. The analysis results of the Forward Selection model were shown in two dimensions, much like the analysis of a Decision Tree model to compare reasonable results. The 1 st dimension is the model analysis reports as presented in Table 6. The 2 nd dimension is the prototype model with the highest efficiency and highest accuracy as shown in Table 7.  Notes: Num_Att = number of attributes, Depth_DT = depth of decision tree model. Table 6 shows the analysis of Forward Selection modeling. It found that the Forward Selection model with 4 attributes at a depth of level 4 and a test of the most effective of the leave-one-out cross-validation method is equal to 82.29% of the highest accuracy. The test results are shown in Table 7. Please note that Table 6 presents only the data with the highest accuracy in each of the increased attributes, with complete analysis provided in the link: https://bit.ly/3xVYe8s.
The attributes that are significant to the model analysis consist of four courses: 001103 Thai Language Skills, 221100 Business Mathematics, 221110 Fundamental Information Technology, and 221120 Introduction to Programming.  Table 7 presents the details of the most effective decision tree models. The details are from Table 6, where it was found that the Forward Selection model with 4 attributes at depth 4 and tested with cross-validation method by leave-one-out was the most effective. It has the accuracy rate equal to 82.29%, the precision rate equal to 85.59%, the recall rate equal to 83.76%, and the AUC rate equal to 45.14%. It can be concluded that this model is suitable for this approach.
Results of the backward elimination model. The analysis results of the Backward Elimination model were displayed in two dimensions, much like the analysis of a Decision Tree model to compare reasonable results. The 1 st dimension is the model analysis reports as presented in Table 8. The 2 nd dimension is the prototype model with the highest efficiency and highest accuracy as shown in Table 9.   Table 8 shows the analysis of Backward Elimination modeling. It found that the Backward Elimination model with 8 attributes at a depth of level 4 and a test of the most effective of the leave-one-out cross-validation method is equal to 82.29% of the highest accuracy. The test results are shown in Table 9. Please note that Table 8 presents only the data with the highest accuracy in each of the reduced attributes, with complete analysis provided in the link: https://bit.ly/3xVYe8s.
The attributes that are significant to the model analysis consist of eight courses: 001103 Thai Language Skills, 001112 Developmental English, 003134 Civilization and Indigenous Wisdom, 126100 Introduction to Economics, 128221 Principles of Marketing, 221100 Business Mathematics, 221110 Fundamental Information Technology, and 221120 Introduction to Programming.  Table 9 presents the details of the most effective Backward Elimination models. The details are from Table 8, where it was found that the Backward Elimination model with 8 attributes, depth 4 and tested with cross-validation method by leave-one-out was the most effective. It has the accuracy rate equal to 82.29%, the precision rate equal to 87.21%, the recall rate equal to 81.62%, and the AUC rate equal to 42.27%. It can be concluded that this model is suitable for this approach.
Notice: From the analysis of the three models. It discovered that there are two models with the highest accuracy, Forward Selection and Backward Elimination methods, with the same accuracy rate at 82.29%. It is imperative to select only one model, which the researcher decides to use for Forward Selection model, as the recall rates (83.76% per 81.62%) and AUC rates (50.00% per 42.27%) are considered higher. The researchers show the rule model in Table 10.   Table 10 shows the Rule Model. It discovered that the model generated when tested with the collected data had highly accurate predictions, with 337 accurate predictions out of 401 (equal to 84.04 percent). This Rule Model was developed as an application which is presented in the next section.

Prototype of the prediction application model
This section presents the user interface of the application at https://bit.ly/3yiP8Tk. It consists of five main pages, as presented in Figures 7 to Figures 11.   Fig. 7. The user interface of the application Figure 7 is the initial process of the application. It is an introduction page with the application name and an agreement. When the user agrees to the agreement, they will go to the next page.  Figure 8 shows the general user information for the application. It consists of first name, last name, email address and phone number. The purpose of this section is to communicate with future users.  Figure 9 shows the filling of academic results in the application to predict the students' academic achievement. The data in this page contains the academic results in four courses: 001103 Thai Language Skills, 221100 Business Mathematics, 221110 Fundamental Information Technology, and 221120 Introduction to Programming.  Figure 10 shows the complete user academic achievement results that is calculated and displayed in Figure 11.  Figure 11 shows the analysis results according to the model obtained from Table 10. The result page is to show the percentage of the opportunity that a student will graduate or dropping out.

Research discussions
The discussion was addressed on two major areas, aligned with the research objective: The student achievement prediction model and the application for tertiary student dropout prediction.

Model discussions
Model analysis issues in this research, the researcher presented the three important models. It consists of model from Decision Trees Classification technique, model from Forward Selection techniques, and model from Backward Elimination technique.
The researchers discovered that the model from the Decision Tree Classification technique with the highest accuracy was the model with depth at level 2 and tested with a leave-one-out Cross-Validation method had the highest accuracy of 78.30%. In addition, the model has a precision of 77.32%, a recall of 88.89%, and AUC rate of 49.30%. While the model from the Forward Selection technique with the highest accuracy was the model with 4 attributes, depth at level 4 and tested with a leave-one-out Cross-Validation method had the highest accuracy of 82.29%. It also has a precision of 85.59%, a recall of 83.76%, and AUC rate of 45.14%. Lastly, the model from the Backward Elimination technique with the highest accuracy was the model with 8 attributes, depth at level 4 and tested with a leave-one-out Cross-Validation method had the highest accuracy of 82.29%. It also has a precision of 82.29%, a recall of 87.21%, and AUC rate of 42.27%.
By analyzing the three models, the researchers concluded that a reasonable and appropriate model to be programmed is the Forward Selection model. The rationale is determined by indicators including Accuracy, Precision, Recall, and AUC. The proof is that the Forward Selection Model has both a significantly higher Recall rate (83.76% per 81.62%) and AUC rate (50.00% per 42.27%).

Application discussions
For discussion of the application, the researcher used a questionnaire to assess their satisfaction with the application. The respondents were 26 students from the Business Computer program in the 2 nd semester of the academic year 2020 at the School of Information and Communication Technology, the University of Phayao.
Software Testing is the process of evaluating and improving the quality of software by identifying software errors or bugs. It can identify useful approaches for improving and modifying the software. The method used is the Black-Box testing. Black-Box testing is a software testing method that validates an application's functionality without having to understand its structure or functionality within the software. This testing method can be used for all levels of software testing. The 1 st topic is the usability test (U-Test). It is an evaluation of a program's ability to interact with the user of an application. The 2 nd topic is the function test (F-Test). It is an assessment of the correctness of the application operation.
The 3 rd topic is the functional requirement test (FR-Test). It is an assessment of the application's capabilities according to the needs of the users. The last topic is the security test (SC-Test). It is intended to evaluate an application in the field of security and data protection.
For the level of satisfaction of the application, questionnaires were characterized by using a Likert Scale. It has set a rating to measure opinions into 5 levels: Satisfaction level 5 means strongly agree, satisfaction level 4 means agree, satisfaction level 3 means neither agree nor disagree, satisfaction level 2 means disagree, and satisfaction level 1 means strongly disagree. The acceptance criteria used in the interpretation of satisfaction levels. It calculates the mean which is within 5 specified levels: The average score equal to 4.21-5.00 is the highest accepted level, the average score equal to 3.41-4.20 is the high accepted level, the average score equal to 2.61-3.40 is the accepted level, the average score equal to 1.81-2.60 is the low accepted level, and the average score equal to 1.00-1.80 is the lowest accepted level. The report analyzes the attitudes and satisfaction with the application are presented in Table 11.  Table 11 shows the attitude and satisfaction toward the application for identifying students at risk of dropping out in tertiary education. Overall satisfaction and attitufe, it was found that respondents were highly satisfied with the application (mean = 3.93, S.D. = 0.89). The most recognized aspect is the function test (F-Test) with mean equal to 4.07 (S.D. = 0.85). The second most recognized ranking is the functional requirement test (FR-Test) with mean equal to 3.88 (S.D. = 0.90). The third recognized ranking is the usability test (U-Test) with mean equal to 3.86 (S.D. = 0.93). Finally, the last recognized ranking is the security test (SC-Test) with mean equal to 3.86 (S.D. = 0.90).
In summary, the applications being developed have a high level of acceptance that researchers will implement in the future.

Conclusion
The situation affecting the Thai education process is the inability to predict the learners' academic achievement. Thus, this research objective is to identify the risk of dropping out in tertiary students with an application. The research goals are (1) to develop the students' achievement prediction model, and (2) to construct a prototype application for the predictions of the tertiary students dropping out.
The research tools consisted of three parts, (1) tool for developing predictive prototypes uses a tool called the CRISP-DM process with Decision Tree Classification, Feature Selection methods, Confusion Matrix performance, Cross-Validation methods, Accuracy, Precision and Recall measurements, (2) tool for application development used the SDLC with V-method, and (3) tool to assess application satisfaction used questionnaires and statistical analysis. Data sample were collected from 401 students enrolled in the Business Computer Program at the School of Information and Communication Technology, University of Phayao during the academic year 2012-2016. The results showed that the prediction model had a very high percentage of accuracy (82.29%). The prototype test results with the data gathered had a very high score level (84.04%; correct 337 out of 401 training examples). An overview of the underlying application with the utmost integrity by the researchers planned to put the application to the test in the first semester of the academic year 2021 at the School of Information Technology and Communication, University of Phayao.
For future research, the researchers plan to conduct a research testing with students in Computer Business, the School of Information and Communication Technology, University of Phayao in the following academic year.

Future works
For future research, the researchers plan to conduct a research testing with students in Computer Business, the School of Information and Communication Technology, University of Phayao in the next academic year. In addition, the researchers plan to create a mobile application for mentors in the University of Phayao to monitor learner on both Android and iOS systems.