Analysis of Conditions for Reliable Predictions by Moodle Machine Learning Models

—In this paper the issue of bias-variance trade-off in building and operating Moodle Machine Learning (ML) models are discussed to avoid traps of getting unreliable predictions. Moodle is one of the world’s most popular open-source Learning Management System (LMS) with millions of users. Although since Moodle 3.4 release it is possible to create ML models within the LMS system very few studies have been published so far about the conditions of its proper application. Using these models as black boxes hold serious risks to get unreliable predictions and false alarms. From a comprehensive study of differently built machine learning models elaborated at the University of Dunaújváros in Hungary, one specific issue is addressed here, namely the influence of the size and the row-column ratio of the predictor matrix on the goodness of the predictions. In the so-called Time Splitting Method in Moodle Learning Analytics the effect of varying numbers of time splits and of predictors has also been studied to see their influence on the bias and the variance of the models. An Applied Statistics course is used to demonstrate the consequences of the different model set up.


Introduction
According to the classical definition Learning Analytics (LA) is "the measurement, collection, analysis and reporting of data about learners and their contexts, for purposes of understanding and optimizing learning and the environments in which it occurs" [22]. Within this rather wide scope of LA one specific objective at a university can be to optimize the teaching-learning process of a university course to predict students' performance or reduce the risk of students not achieving the minimum grade to pass the course. As the online education is getting to be widespread student retention becomes a big challenge that universities must face [5], [9], [10]. Therefore, the Learn-ing Management Systems (LMS) have been continuously revised and different tools are developed to identify students at risk of dropping out [2], [15].
Recently Machine Learning (ML) technics have been widely used to predict students' success (or failure) at fulfilling the course requirements and to help those students who seem to be at risk of dropping out. There are multiple machine learning models used in education [3], [8], [17], [20]. Đambić et al. made a model with 5 independent variables with that purpose of identifying students who needs additional attention. The variables are based on the students' performance at the first period of the semester. The F1 Score of their model was 77% which value is in an acceptable range although their model was highly biased [4]. Lykourentzou et al. proposed a dropout prediction method for e-learning courses and used three different machine learning methods with time-invariant (e.g., age, gender, etc.) and time-varying (e.g., grades, attendance) predictors in their study. The model was split into 7-time segments. The time-invariant student data were found to be less accurate predictors of a student's decision to drop out compared to time-varying data and the models had significantly better values of accuracy as the course progressed [12]. Erkan proposed a method for accurate prediction of at-risk students in an online course [6]. In his study three machine learning algorithms were used separately for the classification of students according their levels of risk. The model contained only time-varying predictors in 3time segments. He found that the number of time-varying predictors has an impact on overall model accuracy. As the courses progressed the models had better accuracy. Moseley and Mead analyzed 3978 records on 528 nursing university students split into training set and validation set [16]. Machine learning algorithm with rule induction method was developed to predict dropouts in nursing courses. The model was able to predict the dropping out with 84% accuracy and 70% of the identified students dropped out in fact. Kotsiantis et al. compared six different ML algorithms in two experiments and found the Naïve Bayes algorithm the most appropriate to be used for the prediction of students' dropout [11]. Their study was among the first to develop a machine learning model to identify students who were likely to drop out. The attributes used in the model were students' registry data and attributes from tutors' recorded data. Tan end Shao developed and compared three prediction models based on different ML algorithms [21]. They found all algorithm to be appropriate for student dropout prediction but they found Decision Tree algorithm presented a better performance. Ram et al. considered 21 variables (sense of belonging, scholastic context, financial status, prior academic details, and family background) which were related to different aspects of student's demographic data in their model [19]. They found that first term GPA is the most important predictor for first-year dropout rates. Furthermore, metrics were defined to infer students' social integration from smart card transactions. These new features were effective in significantly improving precision and recall rates in identifying drop-out students. Chai and Gibson evaluated different models for predicting which first year students are most at-risk of leaving at their first semester of study in the period of pre-enrolment, enrolment, in-semester, and end-ofsemester [1]. A dataset of 23,291 students were analyzed and the used machine learning technics were logistic regression, decision trees and random forests. The model achieved the best performance with logistic regression with 67% precision and 29% recall. Mogus et al. analyzed Moodle LMS logs and students' feedbacks for that purpose to measure students' effectiveness in learning [14]. In their analysis Statistica 8 and Weka software were used for data preprocessing, classification, regression, clustering, association rules and visualization. Evangelista also used Weka software in the model based on Moodle log data [7]. The predictors related to student's study behavior were course viewing time, resource views, quiz taken, replies in discussions, and views at weekends. It was found that predictor attributes such as activities completed, course views and assignment passed are the ones which are strongly correlated to students' performance. Młynarska et al. found that the timing of activities is an important factor in terms of grade [13]. Nurbiha et al. found that attracting the students'attention on the very first page that they visited is a very important factor to encourage them to stay in Massive Open Online Courses [18].
In this paper the issue of bias-variance trade-off in building and operating Moodle Machine Learning (ML) models are discussed to avoid traps of getting unreliable predictions. Moodle is one of the world's most popular open-source learning platforms with millions of users and widely used in online education. Starting with Moodle 3.4 release ML Analytics are integrated part of the system. Using the tool, scheduled analysis can be executed based on ML models. During the course, those students can be identified automatically who are lagging and are at risk of dropping out. The creators of Moodle provided a great tool for course builders with this new feature. However, very few studies have been published so far about the conditions of its proper application [15].
To develop a Learning Management System (LMS) with integrated ML models it requires cooperation of specialists in different fields like course architects, course teachers, statisticians, IT specialists, LMS administrators, etc. Later in the operational phase when the system is used on a daily basis to set up different ML models for different courses usually only some course teachers and LMS administrators are involved. Frequently the ML models are used as ready-made tools by some course teachers.
Even in case of a well-developed LMS system the continual supervision of the applicability of the machine learning models should be inevitable. Using these models as black boxes hold serious risks to get unreliable predictions and false alarms (false no alarms). Unreliable predictions may arise either from the inappropriately determined models or from the differences of the circumstances in the model building and the prediction phases. Here the conditions for reliable predictions by Moodle Machine Learning models are analyzed.

Machine Learning Models in Education
Predictive modelling allows us to model the relation between an educational target (also referred as to dependent variable) and a set of predictors (also referred to as independent variables or features) related to the learners and their learning activities in a given learning contexts. For example, in a binary model for a given student the value of the target can be 0 if the course is passed and 1 if the course is failed. The values of the predictors for a given student are either determined by some known features of the student (e.g.: gender, age, grades in former courses, etc.) or extracted and computed from activity logs stored in an LMS throughout a course. In case of online university courses when students' learning activities are recorded in great details the predictors from activity logs are widely used.
Predictive modelling can be implemented using Supervised Learning (SL) algorithms which search a modelling function that maps the predictors-target relation. In the model building phase, the algorithms use a previous realization of the course (run in one of the former semesters) to determine this function and then in the operational phase, in a later course, it can be used for prediction.
Mathematically it can be written as a hypothesis in the general form where Y is the target, f is some unknown function of the predictors X1, . . ., Xp (usually fixed function with unknown parameters) and ε is a random error term, which is independent of the predictors and has mean zero.
The SL algorithms try to get the "best" ̂ estimate for the unknown f function using the existing data of the predictors and the target. The "best" ̂ is defined as the function which minimizes the so-called cost function Q which is usually the squared expected difference between the actual (Y) and the predicted ( ̂) values of the target.
In practice we use a certain subset of the existing data called Training set and get a certain ̂. If we use a different Training set, we are very likely to get a different ̂. As we keep changing Training sets, we get different outputs for ̂. The amount by which it varies as we change the Training sets is called Variance.
To estimate the true f with different methods like linear or logistic regression we frequently use some simple function between the predictors and the target. For most real-life scenarios, however, the true relationship is more complicated. Simplifying assumptions give Bias to a model. The more erroneous the assumptions with respect to the true relationship the higher the Bias, and vice-versa. Generally, a model will have some error when tested on some test data (called Validation set in what follows). It can be shown mathematically that both Bias and Variance can only add to a model's error. We want a low error, so we need to keep both Bias and Variance at their minimum. However, that is not quite possible. There is a trade-off between Bias and Variance.
If one specific course from its starting date to its ending date is used to build the model (this time interval is used as a whole to collect the data) the predictor Xj takes one specific value for each student. For the whole set of enrolled students, it can be represented by a vector Xj with dimension of n where n is the number of students and the i-th element of this vector refers to the i-th student.
There are situations where the predictor Xj takes different values in different learning contexts for the same student. Here the different learning contexts may refer for example to different courses or to different time segments of the same course. In these situations, the vector Xj has dimension of m which is not equal to the number of students' n anymore since for one student more than one values can be assigned to the Xj predictor relating to the different learning contexts. Now the i-th element of this vector refers to a certain student in a certain learning context. In this case usually additional predictors are given to the model to indicate the context itself. If c different contexts are given for each student, then m = nc.
A widely used version of the aforementioned situation is the so-called Time Splitting method when in case of one specific course one semester is divided into c time subintervals (Time Splits) and the values of the predictors are recorded in each Split.
Here one specific Split defines the specific learning context. In this method c additional predictors S1,…,Sc, the so-called Time Split Indicators, are given to the model indicating the actual Time Split when the data are recorded. The original X1,…,Xp predictors can be called as Core Predictors. In this model the values of the target are repeated in each Split. Hence the set of the Xj predictor vectors form the X matrix of predictors with dimension of (m x r), where r = c+p. The Y target vector has the dimension of m. One row of this X matrix together with the corresponding value of Y is called a Sample. In Table 1 the layout of X and Y is depicted.

The Courses and the Predictors of the Present Study
The course of the Applied Statistics at the University of Dunaújváros in Hungary has been used to build the ML models in the spring semester of the academic year 2019/20. More precisely the data for both versions of the course one for the full-time students and the other for the correspondence students were used. 57 full-time students and 94 correspondence students were involved in the study.
The course was delivered through the Moodle LMS and developed in a way to be able to serve even fully online students. It means that students' learning activities were pursued mainly within the LMS framework and their activities could be recorded in detail. The subject material could be attained by using different type of learning resources and activities: Lecture videos, Minitab videos (videos for problem solving with a statistical software), PDF lecture notes, Books of solved exercises, Quizzes for Self-testing.
Due to the scientific nature of the course the predictors refer mainly to cognitive activities and their values were computed from the recorded course logs like number of quiz attempts, max grades at quizzes, number of clicks to view videos or to learn from lecture notes or from books of exercises.
The learning material was partitioned into 7 chapters and altogether 57 core predictors have been defined. In the different ML models, some of them or all these predictors have been selected.
In this paper only the course for full-time students is discussed. Throughout the course the students wrote four midterm-tests, 25 points of each, and the sum of the points they earned had to reach the minimum points of 70 to get a grade to pass the course. This is an important difference from the correspondence course where the students wrote only one final test to get their grades. An obvious consequence of the continuous testing of full-time students is that the students are encouraged to learn uniformly throughout the course hence the matrix X is not sparse with many empty rows in the early Time Splits. In the case of the correspondence course, the situation is basically different where in the early Time Splits rare students' activity can be recorded and the predictors have zero values in many samples and in this sense the matrix X is skewed and sparse.
Actually 9 students failed out of the 57 hence the Y vector was rather skewed. The main purpose of the prediction was to correctly classify those students who are at risk not fulfilling the minimal course requirement.
One specialty of the Moodle Learning Analytics used in our study must be mentioned since it may have influence on the results. It automatically appends additional predictors to the Core Predictors, namely the averages of the individual predictor values in each Split, hence it doubles the number of the original Core Predictors. To differentiate these averaged value predictors from the original ones in what follows two groups of the Core Predictors will be distinguished: The Learning Predictors (LP, directly referring to some features of the students or to their learning activities) and the Average Predictors.

Performance Metrics
In the comprehensive study mentioned above two different SL algorithms, Logistic Regression (LR) and a two-layer-feed-forward Neural Network (NN) were used to train and evaluate the models (MATLAB 2008, release 2018b). Here only the results of the LR analyses are discussed where 5-fold cross-validation technics were applied.
To quantify the models' performance different widely accepted metrics computed from the confusion matrix were determined. In the present context the elements of the confusion matrix: Plotting the Error rate against the Sample size for both the Training and the Validation sample set the Learning Curves can be depicted. From these Learning Curves not only the actual model Bias and Variance can be seen but the way of possible model improvement can be deducted (more samples or more/less indicators would be better to use).
Accuracy is defined as Accuracy = (TP + TN)/m.

Findings
In Table 2 and in Table 3 the main characteristics of the investigated models and the metrics related to the models' performance are summarized. The same full-time Applied Statistics course was used to build all models. In Table 2 the models contain both the Learning Predictors and the Average Predictors, but in the models in Table 3 only the Learning Predictors form the Core Predictors.  Studying the values in these tables and constructing figures to see the tendencies better some general findings can be summarized: At any model with a given number of Core Predictors the models are getting better according to their Variances as the number of Splits and hence the Ratios are increasing ( Figure 1). Model 10 has the smallest Variance where it has only Learning Predictors and the number of Ratio is 10 (570 Samples, 10 Splits). Variances. This can be seen better in Figure 2 where the nMCC(Tr) -nMCC(V) values are depicted against the number of Splits. When the number of Splits is 1 the large number of Core Predictors causes overfitting and the model Variances are too high. As the number of Splits is on the increase usually the models with larger number of Core Predictors outperform the other models reaching lower Variances.
In cases when the models have only Learning Predictors, at any model with a given number of predictors the models' goodness according to their nMCC (V) values are tendentiously increasing (the Biases are decreasing) as the number of Splits and hence the Ratios are increasing (Figure 3). The best nMCC (V) value is reached at Model 10 with 57 Learning Predictors and 10 Splits. Usually nMCC(V) is larger if the number of Learning Predictors is larger, however it is important to point out that the phenomenon of overfitting appears here also at small Split (Ratio) values. In the models with Learning and Average Predictors the tendencies are similar however the magnitude of the differences between the models with different number of predictors are not so convincing. In Figure 4 its Learning Curve is shown and in Figure 5 the Learning Curve of the Model 14 can be seen as an example for comparison. The closeness of the Training and the Validation curves in Figure 4 indicates low Variance and the low level of the Validation Error rate suggests low Bias. The curves also suggest that collecting more Samples would not improve this model since both the Training and the Validation Error Rate curves reached their "horizontal imaginary asymptote". To further reduce Bias more predictors should be added but in this case the number of Samples also must be increased to avoid overfitting.
The goodness of the model can be characterized with the different performance metrics. The Validation Accuracy = 0.91 is high however it can be misleading. If the focus is on the reliable prediction for students at risk the F1(V) = 0.71 is better to use. More interpretable metrics are the Precision(V) = 0.69 and mainly the Recall(V) = 0.73. This Recall(V) value shows that of all students that actually failed 73% is classified correctly.
The role of the Average Predictors in the models is rather ambiguous. The effect of their presence is illustrated in Figure 6 and in Figure 7. In the models with 57 Learning Predictors and 570 Samples both the Variance and the Bias are worse when the Average Predictors are added to the Learning Predictors. The models become overfitted. In cases with fewer number of Learning Predictors the presence of the Average Predictors may have advantageous effects on model Bias or Variance.

Conclusion
In this paper some important aspects of building and operating educational machine learning models have been discussed. The influence of the size and the row-column ratio of the predictor matrix on the goodness of predictions has been shown. In the so-called Time Splitting method of Moodle Learning Analytics, the effect of varying numbers of time splits and predictors has been studied to see their influence on the bias and the variance of the models. As an example, the results of the analysis of an Applied Statistics course for full-time students were shown.