Comparative Analysis of Supervised Machine Learning Algorithms to Build a Predictive Model for Evaluating Students’ Performance

In recent years, the world's population is increasingly demanding to predict the future with certainty, predicting the right information in any area is becoming a necessity. One of the ways to predict the future with certainty is to determine the possible future. In this sense, machine learning is a way to analyze huge datasets to make strong predictions or decisions. The main objective of this research work is to build a predictive model for evaluating students’ performance. Hence, the contributions are threefold. The first is to apply several supervised machine learning algorithms (i.e. ANCOVA, Logistic Regression, Support Vector Regression, Log-linear Regression, Decision Tree Regression, Random Forest Regression, and Partial Least Squares Regression) on our education dataset. The second purpose is to compare and evaluate algorithms used to create a predictive model based on various evaluation metrics. The last purpose is to determine the most important factors that influence the success or failure of the students. The experimental results showed that the Log-linear Regression provides a better prediction as well as the behavioral factors that influence students’ performance.


Introduction
In the real world, with a remarkable growth within the universe of measured data warehouse sizes, analyzing the data and extracting the useful information is becoming a necessity and a rich topic for several researchers [1]. Many application areas adopt machine learning techniques in their systems such as finance, shopping platforms, restaurants, economy, medicine, tourist targets, and marketing. Over the last two decades, machine learning has entered the e-learning space as well [2] [3] [4] [5]. Thus, several machine learning algorithms have been exploited by researchers to predict hidden patterns from educational settings [6] [7] [8].
The prediction of students at risk for academic failure is of the utmost importance and it must be identified as soon as possible during the academic year. The early prediction of student performance is necessary for higher education for providing highquality education, reducing dropout rates, increasing school completion rates, and improving educational outcomes.
However, the real and major problems are: • How to identify the "weak" students who will need additional help to improve their performance? • Which the best machine learning algorithm (i.e., model) for predicting students' academic performance? • What factors can affect students' academic performance?
This research work evaluates and compares the effectiveness of different machine learning algorithms. While there are many algorithms for creating predictive models, this work concentrates on seven of them, which are ANCOVA, Logistic Regression, Support Vector Regression, Log-linear Regression, Decision Tree Regression, Random Forest Regression, and Partial Least Squares Regression. The present paper also determines the factors affecting students' academic performance.
The outline of the present paper is as follows: Section 2 presents recent studies regarding the specified area. The background of machine learning is briefly described in Section 3. Section 4 concentrates on the proposed approach. A description of the materials, as well as the methods, is presented in Section 5. In Section 6 our implementation and results are presented. Section 7 concentrates on experimental evaluation. Section 8 contains the discussion. Finally, Section 9 presents the main conclusions considering some future research directions.

Related Work
In recent decades, many studies by several research teams have focused on predicting the performance of students based on divers' factors using various machine learning algorithms.
Bravo-Agapito et al [13] explained their study based on the prediction of 802 undergraduate student's academic performance in completely online learning. They used exploratory factor analysis, multiple linear regressions, and cluster analysis. They concluded the "age" is a factor that affects the academic achievement of the student. Gray and Perkins [14] conducted a study on predicting student outcomes as early as week 4 of the Fall semester using machine learning techniques. Hamsa et al [15] applied two classification methods which are decision tree and fuzzy genetic algorithm to predict the student's performance for the Bachelor and Master degree students in Computer Science and Electronics and Communication. Hussain et al [16] described a performance study on predicting student difficulties from learning session data. They have used artificial neural networks, support vector machines, logistic regression, Naïve Bayes classifiers, and decision trees. Their results show that artificial neural networks and support vector machines are the best algorithms to predict the performance of the student. Karthikeyan et al [17] investigated the performance of the students by developing a hybrid educational data mining model called HEDM. Their model combines two techniques which are the J48 Classifier and Naive Baye's classification. Their results show that HEDM outperforms the results obtained in EDM.
In summary, many researchers in their recent papers have made significant results in educational data mining. However, most of them use classification methods for predicting Student' academic performance. Moreover, there was very little focus on interactional and parental involvement features.

Machine Learning
Machine learning reproduces behavior using learning algorithms that are themselves fueled by immense sources of information. The computer trains and improves, hence the word learning; it "learns" from data and extracts knowledge from it.
The algorithms are the engines of machine learning. In general, three main types of machine learning algorithms are used: supervised learning, unsupervised learning, and reinforcement learning.
• Supervised learning: The system learns a function from examples.
• Unsupervised learning: The system does not rely on predefined elements. • Reinforcement learning: consists of letting the algorithm learn from its own mistakes. Faced with a random choice at the start, it uses rewards and punishment as signals for a bad and good decision.
After briefly describing the background of machine learning, in the next section, we will present our proposed approach.

Proposed Approach
Increasingly, E-learning has become an important tool of teaching and learning around the world. Further, Learners have the opportunity to switch to distance learning in various scientific fields anytime and anywhere [9]. It is therefore evident that many researchers work on the various aspects of e-learning [10] [11] [12]. The identification of the "weak" students and the factors affecting students' academic performance is a crucial step for successful learning. Hence, in the present paper, we aim to evaluate the student's academic performance and identifying the factors that influence academic performance using supervised machine learning algorithms.
This research work focuses on the following steps: • Applying several machine learning algorithms which are ANCOVA, Logistic Regression, Support Vector Regression, Log-linear Regression, Decision Tree Regression, Random Forest Regression, and Partial Least Squares Regression. • Comparing and evaluating machine learning algorithms for identifying which are most suitable by using several evaluation metrics which are Mean Square Error (MSE), Root Mean Square Error (RMSE), and R-squared (R 2 ). • Identifying which factors influence the final prediction of students' results.
The next section describes the materials and methods used in our research work which are the dataset, the applied methods, and evaluation methods.

Dataset
The data used for this work's experimentation (available here) is collected from a dataset named "Students' Academic Performance Dataset (xAPI-Edu-Data)" [18] [19]. It is, therefore, an open-source dataset available publicly on the Kaggle dataset repository for academic and research purposes. The primary source of the dataset is from Elaf Abu Amrieh, Thair Hamtini, and Ibrahim Aljarah, The University of Jordan, Amman, Jordan, http://www.Ibrahimaljarah.com, www.ju.edu.jo. This data is obtained from the Learning Management System known as Kalboard 360 [20]. Kalboard 360 has been created to support schools to improve their learning through the use of cuttingedge technology. Typically, any such system share and provides users synchronous access to educational resources from any device that already has internet access. Table 1 provides a summary of the dataset characteristics, including name, abbreviation, source, characteristics, number of samples, area, attribute characteristics, number of attributes, date, associated tasks, missing value and file formats. As shown in Table 1, the dataset considered consists of 480 student records from various countries and 17 features. On the other hand, the features are classified into three main categories, named "Demographic features", "Academic background features", and "Behavioral features" category: • Demographic features: Include qualities such as gender, nationality, and Place of birth. • The academic background features: Represents the background characteristics of students such as educational stage, grade Level, section, and semester. • Behavioral features: Illustrate the behavior such as a raised hand-on class, opening resources, answering surveys by parents, and school satisfaction. Table 2 contains an overview of Dataset features used for training and testing. It contains three fields: feature, description, and type. It should be noted that there are two major feature types, named "Nominal" and "Numeric".
• Nominal: It labels variables by providing non-numeric value.   After seeing the dataset used in our experimentation, in the next section we will present the selected methods for predicting students' academic performance.

Selected methods
It is impossible to predict the future with certainty, but it can determine a highly successful outcome by looking at existing data sources. Nowadays, there are many algorithms for predictive modeling machine learning. In this present work, we focus especially our concentration upon supervised machine learning algorithms because they are the most appropriate (see section III for more details).
In the next sections, we will present the algorithms used to build predictive models which are ANCOVA, Support Vector Regression, Decision Tree Regression, Random Forest Regression, Partial Least Squares Regression, Log-linear Regression, and Logistic Regression.
ANCOVA (ANalysis of VAriance) [21] is a statistical test that makes it possible to compare globally the mathematical expectation of several samples. The name of this test is explained by its way of proceeding: we decompose the total variance of the sample into two partial variances, the inter-class variance, and the residual variance, and we compare these two variances. The ANCOVA model is written as follows (1): Where: y_ij: is the jth observation in the ith group. μ: is a constant common to all individuals. τ_j: is the treatment effect of the jth group. β: is the regression slope corresponding to the covariate xij. x_ij: is the covariate for the ith subject in the jth group.
x ̅ : is the overall mean of x. ε_ij: is a Gaussian error term. As shown in figure 2, ANCOVA help to compare two or more regression lines to each other. [22] is a statistical method for performing binary classifications such as healthy/sick, win/lose, pass/fail, or alive/dead.

Fig. 2. Logistic Regression
It takes qualitative and/or ordinal predictor variables as input and measures the probability of the output value using the sigmoid function shown in figure 2 and defined by the formula (1): [23] is a binary classification algorithm. Just like the Logistic Regression. If we take the image above, we have two classes (e.g., suppose these are e-mails, and Spam mails are in red and non-spam emails are in blue). The Logistics regression can separate these two classes by defining the line in red. The SVR will opt to separate the two classes by the green line (see figure 3).

Fig. 3. Support Vector Regression
Decision Tree Regression (DTR) [25] is an algorithm that uses a graph model (trees) to define the final decision. Each node has a condition, and the branches are based on this condition (True or False). The further down the tree you go, the more conditions we accumulate. Figure 4 illustrates this operation. [24] is part of the family of generalized linear models for Exponential-distributed, Gamma, or Poisson data. This method is a linear approach to modeling the relationship between a response variable and one or more explanatory variables. We assume that the response variable is written as the logarithm of an affine function of the explanatory variables Random Forest Regression (RFR) [26] is a supervised learning algorithm that combines multiple predictions to make a more accurate prediction than a single model (see Figure 5) [27] is a flexible statistical technique applicable to any form of data. It allows modeling the relationships between inputs and outputs, even when the inputs are correlated and noisy, the outputs multiple and the inputs more numerous than the observations. In the next section, we will concentrate on the evaluation metrics used in our experimental study for identifying the best machine learning algorithm.

Evaluation methods
Evaluating a model is a core part of building an effective machine learning model. There are many methods of evaluation that can be used. However, the question is: which metrics should we use to evaluate regression techniques in machine learning? Figure 6 is represented to answer this question. Fig. 6. Right metrics for evaluating machine learning models [28] In the following, we will discuss the three main metrics which we will use in our evaluation.
R-Squared (R2 or the coefficient of determination) [29] is an indicator that allows judging the quality of simple linear regression. It measures the fit between the model and the observed data or how well the regression equation is to describe the distribution of points.
• If the R² is zero, it means that the equation of the regression line determines 0% of the distribution of points. This means that the mathematical model used does not explain the distribution of points. • If the R² is 1, it means that the equation of the regression line can determine 100% of the distribution of points. This then means that the mathematical model used, as well as the parameters a and b calculated, are those which determine the distribution of the points.
In short, the closer the coefficient of determination is to 0, the more the scatter plot disperses around the regression line. On the contrary, the more the R² tends towards 1, the more the cloud of points narrows around the regression line. When the points are exactly aligned on the regression line, then R² = 1.
Mean Square Error (MSE) [30] is the arithmetic mean of the squares of the predictions between the model and the observations. This is the value to be minimized in the context of a single or multiple regressions. The method is based on the nullity of the mean of the residuals. But the average of their squares is generally not zero.
Root Mean Square Error (RMSE) is a standard way to measure the error in model evaluation studies. It is the square root of the mean of the square of all of the errors.

Implementation and Results
The present paper represents a comparison and evaluation of supervised machine learning algorithms for predicting students' academic performance. Many experiments were conducted in seven major steps depending on the regression methods namely ANCOVA, Logistic Regression (Logit-R), Support Vector Regression (SVR), Log-linear Regression (Log-LR), Decision Tree Regression (DTR), Random Forest Regression (RFR), and Partial Least Squares Regression (PLS-R). These regression methods were applied using the XLSTAT environment [31]. In the following, the experimental result of each algorithm is presented. The table above therefore represents summary results for the seven algorithms used in this research work. The evaluation metrics used in this experiment are Mean Square Error (MSE), Root Mean Square Error (RMSE), and R-squared (R 2 ). It should be noted that RMSE is just the square root of the MSE.

Evaluation
After rigorously evaluating all the seven algorithms on the 480 students of our dataset, we compare the performances to determine which model predicts better. According to the experimental results, it is clear that Log-linear Regression (Log-LR) provides better performance because it has a low MSE, low RMSE, and high R 2 score, closely followed by ANCOVA. On the other hand, we observed that Support Vector Regression (SVR) isn't suitable for predicting students' academic performance because it has a high MSE, high RMSE, and low R 2 score.

Discussion
Given the R²= 73% of the variability of the dependent variable, Class is explained by the 16 explanatory variables. The remainder of the variability is due to other explanatory variables that have not been considered during the present experiment research. Table 4 displays the Type III Sum of Squares analysis. This table is very important to determine whether or not the explanatory variables provide significant information. According to Fisher's F-test. lower the F probability corresponding to a given variable. the stronger the impact of the variable on the model. In the table above. we can see that the p-value for the "Viewing Announcements". "Discussion Groups". "Gender". "Nationality". "Place of Birth". "Educational Stages". "Grade Levels". "Section ID". it is clear that the pvalue for "Raised Hand". "Visited Resources". "Parent Responsible". "Parent Answering Survey" and "Student Absence Days" is 0. Therefore, these parameters bring significant information to our model. Furthermore. based on type III errors. it can be inferred that the most influential explanatory variable is "Student Absence Days". The following chart indicates the predicted values versus the observed values. Also. Confidence intervals for the mean allow for the detection of potential outliers.  The following histogram represents the standardized residuals versus the performance. It indicates that the residuals grow with the Performance. As we can see in Figure 8 the residuals bar chart allows to quickly showing the residuals that are out of the range [-2. 2]. Fig. 8. Standardized residuals versus the performance As conclusion. "Raised Hand". "Visited Resources". "Parent Responsible". "Parent Answering Survey" and "Student Absence Days" allow us to explain 73% of the variability of the performance. Further analysis would be necessary because an amount of information is not explained by our model.

Conclusion and Future Work
In recent years. predicting a student's academic performance is the main objective of all educational institutions. The numerous studies demonstrate that machine learning can be an efficient technology to meet this objective. In this research work. our first aim was to compare several machine learning algorithms for predicting student's academic performance. Therefore. we apply and evaluate several algorithms which are ANCOVA. Logit-R. SVR. Log-LR. DTR. RFR and PLS-R. Our second aim was to determine the relationships between the features and the student's academic performance. As a result of our experimental study. we can conclude that "Raised Hand". "Visited Resources". "Parent Responsible". "Parent Answering Survey" and "Student Absence Days" provide a significant amount of information for predicting student's academic performance. Certainly, this research work has some limitations. That's why the major directions for future work could focus on the following: Firstly. applying techniques such as clustering and artificial neural networks to have better predicting. Secondly. utilizing dataset with massive size and diverse features to tackle the issue of