Survey of Machine Learning Techniques for Student Profile Modelling

— Developments in information technology have led to the emergence of several online platforms for educational purposes, such as e-learning platforms, e-recommendation systems, e-recruitment system, etc. These systems exploit advances in Machine Learning to provide services tailored to the needs and profile of students. In this paper, we propose a state of art on student profile modelling using machine learning techniques during last four years. We aim to analyse the most used and most efficient machine learning techniques in both online and face-to-face education context, for different objectives such as failure, dropout, orientation, academic performance, etc. and also analyse the dominant features used for each objective in order to achieve a global view of the student profile model. Decision Tree is the most used and the most efficient by most of research studies. And academic, personal identity and online behaviour are the top characteristics used for the student profile. To strengthen the survey results, an experiment was carried out, based on the application of machine learning techniques extracted from the state of art analysis, on the same datasets. Decision tree gave the highest performance, which confirms the survey results.


Introduction
Student profile modelling relies on a profile representation that captures the main characteristics and gives the most coherent, complete and operational representation of the student.The student's characteristics include background knowledge, learning preference, behaviours, skills, goals, etc.The student profile model can be constructed through the analysis of data from different sources as student records, social networks, learning platforms, web form, etc.Indeed, several works used the student profile either to propose an adaptive learning, to guide him in his/her academic choices, or to make recommendations about his/her future career.Machine Learning (ML) is one of the used methods used for student profile modelling, that aims to create knowledge automatically from data.Mainly, they are used in classification, prediction and in decision support domains [1].Machine Learning techniques can be usually classified into three categories: Supervised learning, unsupervised learning, and semisupervised; but with the development in artificial intelligence others classes were introduced as deep, transfer and reinforcement learning.This techniques were applied in several levels to achieve many academic objectives like predicting failure or dropout, orientation and academic decision making [2]- [4].
In this paper, we propose a survey of research studies on student profile modelling using machine learning techniques during last four years (2016)(2017)(2018)(2019).We focus on the student's features categorization, machine learning techniques based-on and the context of the research study, to be able in the future, to develop a profile model and to generalize on the conclusions obtained in our previous work which has shown that decision trees are the most efficient based on academic data [5].
This article is organized as follows.In section II we present the related works dealing with student profile modelling using machine learning.In section III, we give a comparative and statistical analysis of the different research studied.The last section presents a case of study where we applied different machine learning techniques on two online datasets and at last, we give a conclusion and some perspectives.

Related Works
With the availability of data on the e-learning platforms, several studies have been carried out in order to adapt the training and to personalize the contents to the learner expectations.A. Topîrceanu et al. [6] proposed to optimize the way e-learning systems are developed using Decision Tree (DT) technique.S. V. Kolekar et al. [7], proposed to identify the way students learn to customize the resources delivery through the use of FCM clustering and NN-based classification techniques.M. Abdullah et al. [8], have performed multiple classifiers as J48, NBTree (Naive Bayes (NB) in association with DT) and NB in association with Sequential Minimal Optimization (SMO) to carry out the relationship between educators and student's learning style.R. Cerezo et al. [9] propose a student's classification based on their behavior to predict their achievement.Other models for student's performance were proposed based on Expectation-Maximization (EM) and K-means techniques [10] and SVM and Association Rules [11].
Failure and dropping out of school are serious educational systm challenges, that is why several studies have been devoted to these topics.Based on academic students' data, a system for an early detection of students with difficulties [12] was proposed using three classification techniques : Random Forest (thresholds and leaves), Logistic Regression (LR) and Artificial Neural Network (ANN).They all gave good accuracy and Random Forest TH (thresholds) was the most efficient.A. U. Khasanah et al. [13] proposed a system to prevent student failure, Bayesian Network and DT techniques were implemented and compared.The most significant prediction was provided by NB.Likewise, in the context of failure prediction, other classifier models were proposed such as SVM (RTV-SVM) [14] and ANN [15].Several researches have been conducted to understand students' reasons for dropping and some reported were economic situation, social status, drugs and motivation.In the context of MOOC platforms, M. Khalil et al. [16] used K-means clustering to split the students according to their engagement and behaviors.C. Burgos et al. [17], proposed a predictive model of courses' student drop-out using a system based on LR.Another approach [18] based on big-data was conducted to minimize drop-out by enhancing online learning courses to satisfy the learner objectives, the SVM, NB and K-NN techniques were compared and SVM was the most efficient.In [19], S. Kai et al. proposed a model using J-Rip classifier and J-48 Decision Trees to predict the potential students that would continue participation in the online college program.In the traditional education, random forests was used [20] to predict students at risk of dropping out.Using data gathered from a large dataset, L. Aulck et al. [21], proposed to seek determinant features in prediction of dropout students and make recommendation to reduce it.Some authors tried to identify relevant student' attributes to predict student dropout rate with ID3 DT [22].C. Márquez-Vera et al. [23], proposed a methodology and a specific classification algorithm ICRM2, a variant of GP known as grammar-based genetic programming (GBGP), to discover comprehensible prediction models of student dropout earlier.
The recommendation systems rely on the students' features to propose them the appropriate pedagogical contents or the most relevant educational pathways.Making a recommendation for students, teachers, educators and administration was the objective of the study in [24], and C4.5 algorithm was used to improve learning outcome, by detecting automatically the students' learning styles and recommend them the better aligned contents.Linear Discriminant Analysis (LDA), LR, and the Linear SVM (LSVM) were exploited to improve the e-learning platform by offering a personalized learning situation to guide student behavior, and LSVM made the best accuracy [25].Based on students' academic history analysis, P. Dash et al. [26] proposed a decision support system that helps the student to select a particular subject to read and RF algorithm gave 99% for accuracy.
Other researches have been conducted to improve the student performance.S. Bharara et al. [27], aim to find the features that directly influence student performance using the K-means algorithm.A student's performance prediction model based on the interactivity with the e-learning management system is proposed by using the three classifiers ANN, NB and J48 which was the most efficient using three class labels [28].A. Mueen et al. [29] proposed prediction model for students' academic performance based on their academic records and forum participation by using the three classifiers Multiple Layer Perceptron, DT(C4.5) and NB that was the most powerful.The impact of using social networks was analyzed to predict the academic results by using CART method [30].To enhance the quality of the higher education system by evaluating student performance in courses, A. El-Halees [31], applied data mining techniques to discover knowledge based on association rules, classification (J48), also they clustered students into groups using EM clustering.The paper [32], proposed to make early intervention to improve the module results and enhance student's experience by using RF and SMO.Another work proposed by T. Mahboob et al. [33] suggest to help the students to improve their performance by evaluating themselves on the basis of their prior records and act as a guide for future evaluations on performance.In the context of traditional education, C. Masci et al. [34] aim to identify student characteristics that have a direct impact on his or her results using regression trees.Understanding issues and problems students encounter in their learning experience with the goal of minimizing the student's educational problems is the objective of S. Patil et al. [35], by using NB, DT(ID3) and Memetic Algorithm (MA) which was the most efficient algorithm.In order to build an interpretable student performance prediction model, comment data mining with DT(C4.5) and RF have been performed and RF was the most efficient [36].Studying the relationship between the cognitive admission entry requirement and the academic performance of students in their first year, using NN was the objective of [37]. A. Abu [38] explored multiple factors that affect students' performance in higher education to predict their performance, and four DT algorithms have been implemented as well as NB and the variables 'neighborhood'(student's residence) and 'school' were the main factors that affect student's performance.A new model that enhances the DT accuracy in identifying student's performance was presented in [39], four DT algorithms were applied and BFTree shown more accuracy than other classifiers.K. Karthikeyan et al. [40], predict students' performance and give them a chance to improve it in the future.The research work combines two data mining techniques, namely, Clustering (Enhanced K-Means) and Classification (SVM) named as CESVM-SPPS and it gave successful results.To identify students with special need attention from the beginning of the course at the right time, the authors used K-means clustering to concentrate students in groups of similar characteristics [41].Association rules was performed to help educators understanding the learning and psychological states of students in different grades, so as to formulate teaching plans and improve their academic performance [42].R. Asif et al. [43] analyze the performance of students and study the directors program which could help them improving the program, and the NB was the most efficient technique used to predict the graduation performance in a four-year university program.
Some researches studies focused on improvements that can be done on learning platforms.The objective is to help administration to improve the learning environment by analyzing different students' opinions by using BN [44].In [45], the authors examined the variation in students' confidence and engagement with digital technologies in learning and considered possible implications for teacher's learning design using association rules.S. K.Howard et al. [18] proposed a big-data driven approach for online learning evolution to discover students' learning patterns to guide courses improvement and satisfy the learner by comparing three machine learning techniques (SVM,NB and K-NN), and SVM was the most efficient technique in this case.

3
Comparative Study

Criteria
The state of the art addresses various academic problems where providing a model of student profile is essential to give effective solutions.Sometimes the same iJET -Vol.16, No. 04, 2021 techniques are used in different contexts.The discovery of student patterns is often based on clustering techniques and a set of characteristics have been identified in order to model the student profile accurately.In this comparative study, we rely on set of criteria as described in table 1, and we focus on: publishing year, Objective: the subject covered by the paper, Context: the circumstances surrounding learning which can be either distance learning or e-learning or both.
Dataset source and size, student features : each feature of each state-of-the-art paper is assigned to a category of features according to a categorization defined in [46], and the techniques ML applied and their performance.For the performance value we gave only the best result and the belonging technique is given in bold font.Technique NB, NN, SVM, K-NN, DT (J48, ID3, C4.5, J-Rip, CART, etc.), MA, RF, K-means, FCM, EM, LR, AR, etc. Performance metrics Accuracy (A), Precision (P), Recall (R), F-Score (F) and (O) for other performance metrics (Confidence (C), Area Under Curve (AUC), Average Silhouette (S))

Statistical analysis and discussion
In Table 2, we present a comparison of researches dealing with student profile modeling using machine learning during last four years.The purpose of most papers is to improve student academic performance, to understand difficulties encountered in the learning process and how to enhance competencies.For other papers, to identify parameters that influence failure or dropout was a real challenge.These studies analyze several context-dependent factors in order to predict a student's outcomes and propose solutions to deal with educational challenges.The third most common objective is the adaptive learning, to improve the quality of the learning environment and finally to improve the students' outcomes.The data source used varies according to three types: questionnaires, databases (from the university or drawn online), or both of them.Most of them uses a data size of the hundreds scale and researches that use a large data size are quite rare.In our analysis, we are interested in studying the objectives that have been most addressed in the research studies to understand the major concerns of the academic community.As shown in figure 1, the student academic performance is the greatest goal of the community, followed by adaptive learning, failure and dropout and different machine learning techniques have been investigated to improve performance, and explain the reasons for failure and dropout.The survey shows that the most prevalent context is traditional education, followed by e-learning and then 25% of papers are related to both contexts.

Ref
Data quality and size are important factors in the field of learning machines.The databases from learning management system (LMS) are generally large, but the size of university databases generally varies between the hundreds and thousand scale.From our analysis, we notice that most authors used databases from academic systems or online e-learning systems, and the use of questionnaires represents only 27% of the researchers and 13% use the data from both.
Figure 2 presents the distribution of the student features' categories used in the research studied.The academic information (grades, major, diploma, etc.), are the most used for different contexts with a percentage of 75%, followed by personal identity characteristics (65%) (Gender, age, nationality, etc.), online behavior (45%) (Comments, navigation, quizzes, etc.), and then social identity (37.5%) (Marital status, parents' education, parents' job, address, etc.).We can say that academic and personal information are unavoidable regardless of the purpose of the research study.

Fig. 1. Student features' categories distribution
To explain relation between the student features' categories and the context, the figure 3 reveals that the distribution of the top five student's characteristics depends on the context, and that online behaviour is the most used in the context of e-learning, whereas the academic performance and the social identity characteristics are the most relevant for traditional education.

Experiments
From the comparative study we can notice that Decision Tree performs better than others machine learning algorithms in the educational field, but as these latter have been applied to different datasets, it is difficult to confirm these result.Therefore, our objective in this experimental study, is to apply these techniques on two datasets in order to confirm the results of our study by applying Machine Learning algorithms cited on state of art on two datasets from two contexts : TE and EL & TE.

Background of machine learning techniques used
Decision Tree (DT) is multistage decision making technique able to break down a complex decision making process into collection of simpler decision providing an easier solution to interpret [47].The most used implementations today are: ID3, C4.5, C50 and CART.ID3 (Iterative Dichotomiser 3) builds the decision tree recursively, at each step of the recursion, it calculates from among the attributes remaining for the current branch, the one that will maximize the information gain using Shanon's entropy [48].C4.5 is an extension of ID3, to overcome the limitations of ID3 and one of its implementation is J48.CART (Classification and Regression Trees) is a binary tree construction algorithm [49].K-Nearest Neighbors (K-NN) is also used in our comparison process, which is one of the oldest and simplest methods for pattern classification, it is a very intuitive method that classifies untagged examples on the basis of their similarity with the examples of the learning dataset [50].Support Vector Machine (SVM) is a discriminant model that attempts to minimize learning errors while maximizing the margin between class data [51], SVM is particularly effective at handling high-dimensional data.Neural Networks (NN) have been developed as generalization of mathematical models of biological nervous systems and they have shown their effectiveness in several fields [52], [53].Naïve Bayes (NB) is a process that estimates the probability of a new observation belonging to a predefined category [54].Logistic Regression, is a predictive technique that aims to build a model to predict / explain the values taken by a qualitative target variable (most often binary, we then speak of binary LR; if it has more than 2 modalities, we speak of polytomous LR) from a set of quantitative or qualitative explanatory variables [54].

Dataset
Two types of dataset with two contexts (traditional education and e-learning & traditional education) were used.Traditional Education dataset (TE) [55], is a two classes dataset (Pass and Fail based on the final grade), consists of 395 student records and 33 features, collected from student achievement in secondary education of two Portuguese schools, and we focused on Mathematics subject to predict student performance.We chose this dataset because it purely reflects traditional education and because it mainly contains the same features' categories that we extracted during our analysis for this type of context, which is based on academic data, student behaviour features such as: raised hand on class, opening resources, answering survey by parents, social data, etc.The second dataset (EL & TE), is a three-class dataset, where students are classified into three classes based on their total grade marks (Low level, Middle level and High level), it contains, in addition to the data from the traditional context, data reflecting the online behaviour of the student, namely visited online resources, discussion groups, etc.This dataset was collected from learning management system (LMS) called Kalboard 360 [28,56], and consists of 480 student records and 16 features including personal identity, social identity, online behaviour, learning behaviour and especially the important characteristics that we extracted from our analysis for this type of context.

Process and result
To lead our case study, a well-defined process is adopted as shown in figure 5, to detect the most efficient ML techniques in the classification of each dataset, we proceeded to a feature selection by information gain method is used to keep the important attributes, and a partitioning of the data using the split method (30% for testing and 70% for training) was carried out.The chosen ML techniques were applied to the two types of dataset to see the most efficient technique (with the highest accuracy).The algorithms used are: C4.5, J48, ID3, CART, NB, SVM, NN, LR and K-NN, and the evaluation is done by the accuracy performance metric.
iJET -Vol.16, No. 04, 2021  3 shows the performance results (Accuracy) of each algorithm applied on each dataset.From de Table 3, we notice that for TE dataset, the best performing techniques are techniques derived from the Decision Tree algorithm (C4.5, J48 and CART), the algorithm which gave the lowest accuracy is SVM (78.81%).For the EL&TE dataset, neural network gave the best performance, Logistic Regression and C4.5 also gave a good performance around 70%.We notice that the decision trees with its variants, gave good performance results for the traditional dataset, where there was no online data, and for the other which actually includes this type of data, the neural network was the most efficient, which, according to our survey, was ranked third level for the most used algorithms and fourth level among the algorithms that were used and performed the best.

Conclusion
In this study, we present a survey on student profile modeling using machine learning in order to give the most efficient machine learning techniques used and the overall description of the student profile used in different fields.The study shows that the Decision Tree algorithms are the most used and the most efficients among all research studies cited in this comparative study.In addition, the main student features used in the profile modeling was academic information by more 70%, followed by personal identity and online behavior, which shows that the combination between academic and online behaviors when modeling a student profile using machine learning has become more important.An experiment was carried out on two datasets, the results approved that the Decision Tree algorithm gives a good performance for both datasets and especially for traditional education context.In our future works, we attempt to propose a generic student profile model that can be exploited in many situations such as: prediction, classification, adaptive learning, and erecommendation.

Fig. 2 .
Fig. 2. Student features' categories distribution based on studies contextsAt last, figure4highlights the most common machine learning techniques used (a) , and we can see that Decision Trees and its variants is the most used technique (63%) with a very wide margin compared to other techniques (NB: 35%,NN:35%, SVM: 23%).In figure (b), we notice that Decision Trees technique is always in first level compared to other techniques and it is most efficient in 40% of the research studies, followed by NB and SVM (13% for each) and neural network with 10%.

Fig. 4 .
Fig. 4. Comparison methodologyTable3shows the performance results (Accuracy) of each algorithm applied on each dataset.From de Table3, we notice that for TE dataset, the best performing techniques are techniques derived from the Decision Tree algorithm (C4.5, J48 and CART), the algorithm which gave the lowest accuracy is SVM (78.81%).For the EL&TE dataset, neural network gave the best performance, Logistic Regression and C4.5 also gave a good performance around 70%.