A Data-Driven Emotion Model for English Learners Based on Machine Learning

—Learning confusion is a common emotion among learners. With the aid of machine learning, this paper develops a data-driven emotion model that automatically recognizes learning confusion in facial expression images. The data on learning behaviors and learning confusion of multiple subjects were collected through an online English evaluation experiment, and imported to the proposed model to derive the relationship between learning confusion and academic performance, which is measured by the correctness of the students’ answers to the test questions. The experimental results show that the students with learning confusion had relatively low correct rate of answering test questions. The research findings reveal the relationship between learning confusion and academic performance, laying the basis for predicting the academic performance of English learners through machine learning.


Introduction
Learning emotions are an important attribute of learners.Many researchers have tried to effectively recognize learning emotions, and intervene in negative learning emotions [1].Some of them demonstrated that the negative learning emotions can be converted into positive learning emotions to improve academic performance by timely solving the confusion in learning [2,3].Learning confusion is common among learners.A long confusion period will frustrate the learner, making it impossible for them to effectively acquire knowledge.In fact, the longer the confusion, the more negative the learning emotions, and the worse the academic performance [4].
Most of relevant studies [5][6][7] have attempted to describe learning emotions comprehensively, but paid little attention to the influencing factors and measuring methods of such emotions.Some scholars [8,9] predicted various learning emotions, such as engagement, depression, self-confidence, boredom, and confusion, based on facial expressions and learning behaviors, and revealed the inhibitory effect of learning confusion on the learning process: continuous confusion suppresses the interest in and motivation of learning, and drags down the learners' academic performance.In online learning, a course is unlikely to be chosen or completed, if its contents confuse the learners.
Intelligent teaching [10] is an effective strategy to regulate the learning emotions.This strategy fully integrates artificial intelligence (AI), data analysis, and personalized recommendation to promote the problem-solving ability, and enhance the learning motivation of autonomous learners.In the field of intelligent teaching, one of the hot topics is data-driven modeling of learner knowledge, behaviors, and emotions based on the interactive data of learners.Meanwhile, the confusion recognition function in massive open online course (MOOC) and other online learning platforms can reduce learning confusion, and improve the course retention rate.The combination between data-driven modeling and confusion recognition is expected to capture and adjust the confusing course contents, rid the influence of negative learning emotions, and thus improve the academic performance of learners.
This paper designs an online English evaluation experiment to induce learning confusion with questions of different difficulties.Then, the facial expressions of the subjects were captured by a camera, and the class label of each image was defined through self-evaluation.Next, the key features were extracted from the images, and processed by machine learning and deep learning to automatically identify learning confusion.Finally, the online learning behaviors of multiple learners were collected, and imported to several common machine learning algorithms to predict their academic performance.The predictions reveal the relationship between learning confusion and academic performance.

Literature Review
Learner modeling [11] is an interdisciplinary task to construct a model of the knowledge, behaviors, and emotions of learners.To complete the task, it is necessary to integrate techniques from various fields, ranging from psychology, pedagogy, to computer science.In general, learner modeling deals with one of or all the following attributes of learners: knowledge state, cognitive behaviors, and learning emotions.
The knowledge state of learners is commonly modeled by coverage model [12], deviation model [13], and Bayesian knowledge tracking model [14].To improve the Bayesian knowledge tracking model, Zhang and Yao [15] added implicit nodes to simulate the learning ability difference among learners.Sun and Bin [16] noticed that the current online education model is difficult to quantify the teaching process and lacks data support for curriculum design, and solved these problems by introducing knowledge tracking to online course learning; In this way, the knowledge level of students was tracked and understood in time, and the learning problems were exposed to teachers for timely adjustment of the teaching strategy.Dash et al. [17] improved the traditional knowledge tracking model, which could only follow a single knowledge point, but failed to capture the difficulty of test questions; the improved model can iJET -Vol.16, No. 08, 2021 simultaneously track multiple knowledge points, and make accurate evaluation of the learning ability and overall capacity of each learner.
The teaching process involves various data on learner behaviors, in addition to evaluation data.The knowledge state model alone cannot cover all the attributes of learners.Thus, many scholars have modeled the cognitive behaviors of learners.For instance, Li et al. [18] analyzed the observations, submission results, and collaborative activities in MOOC platform, and constructed a multi-dimensional network for cognitive behaviors.Elstad et al. [19] proposed a deep learning model for learning behaviors, which could accurately predict the future learning behaviours of the students according to the data on their historical behaviors; the deep learning model not only considers the evaluation data on the students, but also takes account of the data on their online learning behaviors.To ease the loneliness and lack of cooperation for learners in distance education, Vrieling et al. [20] established a six-dimensional learner behavior model based on the learning behavior data from online teaching platforms, and identified similar and complementary learning behaviors through similarity calculation.
In recent years, a growing attention has been paid to the modeling of learner emotions.As a key attribute of learners, learning emotions affect learning motivation, cognitive behaviors, and learning actions.The constructivist learning theory suggests that learning emotions determine the capability of knowledge acquisition.Marchand and Gutierrez [21] introduced learning emotions to distance education, and combined Ortony-Clore-Collins (OCC) emotion model with two-dimensional emotion model to illustrate the cognition-emotion interaction in distance education.Considering learning style and learning emotions, Gamalel-Din [22] created a well-established model for elearning students, which overcomes the emotional deficiency, a defect of the traditional online learning model, and improves the intelligence and diversity of online learning.Craig et al. [23] recognized the basic emotions of intelligent teaching systems, namely, depression, boredom, and confusion, and mentioned the heavy presence of confusion in these systems.
The data on learning emotions come from various sources, including images on facial expressions, physiological data, and text data.Based on facial features, Shivakumar and Vijaya [24] designed an emotion coding system that encodes facial features by the performance difference among students with different facial expressions; the system is capable of recognizing six emotions, such as happiness and surprise.Val-Calvo et al. [25] discovered that learners with stable emotions rarely change their facial expressions, making it difficult to measure their learning emotions from their emotional changes; To overcome the difficulty, these researchers recommended to measure learning emotions by monitoring physiological indices like the variations in heart rate, blood pressure, and electroencephalogram (EEG) signal.Chakraborty et al. [26] constructed an emotion model based on facial expression images and heart rate data, and successfully detected learning emotions based on dynamic Bayesian network.With the latest data processing technology, Sun and Bin [27] effectively recognized learning emotions by analyzing multimodal data, and significantly improved the accuracy of emotion modeling.

Learning Confusion Recognition Based on Facial Expressions
In this paper, an online English evaluation experiment is designed to induce learning confusion among students.The images on their facial expressions were collected, and imported to a machine learning algorithm for automatic recognition of learning confusion.The technical roadmap of our research is given in Figure 1.

Fig. 1. Technical roadmap of learning confusion recognition based on facial expressions
Note: SVM and CNN are short for support vector machine and convolutional neural network, respectively.

Data pre-processing
A total of 200 students were selected as the subjects.Half of them are males and half are females.Every subject was asked to answer 30 English test questions on the computer.The ratio of easy, medium, and difficult questions is 1:1:1.As they answered the questions, their facial expressions were captured by a camera of another computer.Five images were shot during the answering of each question.In the end, 30,000 images (size: 640×480) on facial expressions were collected from the subjects.
However, the collected data contained missing values and noises, which might undermine the effect of model training.For example, some subjects shook their heads during the experiments.Once their faces deviated from the middle of the field of view, the camera would be unable to capture the facial expressions accurately.Besides, the original images obviously contained many feature points unrelated to facial expressions.To improve the quality of training data, the original data were preprocessed through a series of operations, namely, data cleaning, feature extraction, and normalization.
Among the 30,000 original images, 4,920 were found to have fuzzy facial features or seriously deviated faces.After removing these invalid data, 25,080 datasets of valid facial feature points were obtained.In addition, 78 feature points related to facial expressions, i.e., the key feature points, were extracted from each image.The extracted feature points belong to the key parts of the face, such as corners of the mouth, cheeks, eyes, and eyebrows.The 78 feature points were mapped into a one-dimensional data, creating a 156-dimensional dataset of feature points.
Furthermore, Z-score normalization was performed to improve the data quality, in order to reduce the variance and ensure the modelling effect.During the normalization, iJET -Vol.16, No. 08, 2021 the standard deviation and mean of the collected data were calculated, and the data were adjusted to the interval of [0, 1].After the processing, the final data conform to the normal distribution.

Model construction
Our automatic recognition model for learning confusion was established based on five common machine learning classifiers: logistic regression (LR), decision tree (DT), SVM, k-nearest neighbours (KNN), and random forest (RF).All these classifiers are supervised learning methods.Thus, the preprocessed data need to be divided into a training set and a test set before the modelling of learning confusion.

LR
The LR is one of the simplest modelling algorithms for learning emotions.The generalized LR could predict emotions with multiple discrete or continuous variables.The predictions are binary or multivariate.Binary LR was chosen for this research, because our prediction task only involves two variables: no confusion (0) and confusion (1).
In essence, the LR is a linear classifier based on probability model.The probability of an item is measured by probability ratio.The logarithmic function of probability ratio can be defined as: The function () receives inputs in the interval of [0, 1], and maps them to the whole range of real numbers.
Practically, this paper needs to predict the probability that a sample belongs to each class.The inverse function of log(), also known as sigmoid function, is commonly used:

SVM
The SVM classifies data by dividing the maximum support plane.Following strict theoretical derivation, the SVM is a reliable high-performance classifier.The prediction effect of the SVM depends on the selection of kernel function.Here, Gaussian function is selected as the kernel function of SVM.
The training samples are divided into two classes by the decision boundaries of the SVM.Any sample on the boundaries must satisfy: The two hyperplanes  1 and  2 can be expressed as: The edge of the decision boundaries depends on the spatial distance between  1 and  2 , which can be calculated by: The classifier is trained and fitted by maximizing the distance between decision boundaries, i.e., the classification interval.A long distance means a small classification error, while a small distance brings the risk of overfitting.

KNN
The KNN is a classifier based on spatial distance.The similarity between samples is measured by their spatial distance, normally Euclidean distance, and used to divide the samples into different categories.The classification process of the KNN is implemented in three steps: • Step 1. Calculate the spatial distance between each target sample and each training sample, and sort the results in ascending order.The Euclidean distance between samples in an n-dimensional space can be calculated by: • Step 2. Select the K samples, which are the closest to the target sample, and compute the class frequency of these samples.• Step 3. Take the class with the highest frequency of the K samples as the predicted class of the test samples, that is, allocate the target samples into that class.

DT
The DT classifies samples based on information entropy and Gini coefficient.At present, the most popular DTs are ID3, C4.5, and Classification and Regression Trees (CART).In this paper, the CART is employed to predict learning confusion.
The DT divides the original data by information gain, and repeats the division on the sub nodes on each tree iteratively until reaching the leaf nodes.The performance of data division can be improved by using the maximum information gain.This requires an objective function that maximizes the information gain in each partition: where,  is the number of eigenvariables to be divided;   and   are the parent node and the j-th child node, respectively; () is the impurity coefficient;   is the number of samples in the parent node;   is the number of samples in the j-th child node.
Information gain refers to the difference between the parent node and all the child nodes in the sum of impurity.As shown in formula (8), it is clear that the information gain is negatively correlated with the impurity of the child node.
iJET -Vol.16, No. 08, 2021 In the DT, the impurity could be measured by three criteria: entropy, Gini coefficient, and misclassification rate.Among them, entropy is a common classification c of nondegree coefficient: According to the calculation method of entropy, workflow of the DT can be summarized as follows: • Step 1. Set up a new node creation function for the DT, and record the nodes or test conditions of the tree.• Step 2. Choose the attributes (test conditions) for dividing the training set, the most important of which is the division of the impurity coefficient.• Step 3. Determine the class label for each node, i.e., specify which class the data belong to, and calculate the probability (|) that the -th sample belongs to the current class.• Step 4. Check whether all data belong to the same class or have the same attribute value, and prevent overfitting by pruning the DT.

RF
RF is an integrated method extended from the DT.Each RF encompasses multiple DTs.Compared with a single DT, the RF can achieve highly accurate classification by voting.In our research, a 7-DT RF is designed to extract the voting results on randomly divided attributes, and thus classify the learning confusion.As shown in Figure 2, the RF is implemented in three steps:  The DFNN is one of the most used CNNs.The numerous hidden layers enable the network to classify complex models.If sigmoid serves as the activation function, vanishing gradients will take place in the DFNN training.To avoid this phenomenon, this paper chooses rectified linear unit (ReLU) as the activation function.Moreover, five hidden layers were designed for the DFNN.Through iterative backpropagation, the weight and bias were calculated for each layer.
Machine learning models are usually trained by gradient descent algorithm, which iteratively obtains the local optimal solution.Here, each DFNN is represented as an instructive acyclic graph.For the purpose of deep learning, the gradient was calculated from top to bottom through backpropagation by chain rules, thereby determining the weight and bias between hidden layers.

Result and Analysis
The classifier performance is generally measured by accuracy, precision, recall, and F-score.The four metrics can be calculated by the relationship among the true rate (TR), false negative rate (FN), false rate (FR), and true negative rate (TN) (Table 1).The TR refers to the number of correctly classified positive samples; the FN refers to the number of incorrectly classified negative samples; the FR refers to the number of incorrectly iJET -Vol.16 As shown in Table 2, the classification accuracy of the DFNN was not significantly higher than the traditional machine learning algorithms.The RF achieved the highest accuracy of 71.39%, while the LR ended up with the lowest accuracy of 58.79%.In addition, the six classifiers performed better on confusion samples than on no confusion samples.
To clarify the relationship between learning confusion and academic performance, 50 college students were invited to participate the online English evaluation experiment.Besides answering each question, the subjects were asked to judge if he/she was confused by the question.Table 3 presents the behaviour data on all the 50 students.Statistical analysis on the 1,000 questions shows that 245 questions were confusing, while 755 were not.Among the students confused by the questions, 183 (70%) checked the analysis on these questions.That is, most students solved their confusion by looking at the question analysis.
Through feature selection, two variables, namely, learning confusion and checking analysis, were used to predict the correctness of the students' answers, and evaluate their academic performance.The academic performance was predicted separately by LR, SVM, RF, and RT.Tenfold cross validation was adopted to compare their prediction results.As shown in Table 4, the LR achieved higher prediction accuracy than the other three methods, followed by the RF, while the SVM had the worst prediction effect on academic performance.The RF surpassed 60% in prediction accuracy, higher than that of DT, thanks to the integration of multiple DTs.
Besides, it is more accurate to predict the academic performance solely relying on learning confusion, than relying on both learning confusion and checking analysis.For the LR, the prediction accuracy was 62.95%, when only learning confusion was considered; the accuracy dropped to 56.71%, when both learning confusion and checking analysis were considered.
Figure 3 displays the relationship between the number of checking analysis, number of confusing questions, and total score.It can be seen that the number of checking analysis was directly related to the number of confusing questions.This is because the students want to ask for help by pressing the "analysis" button, when they are confused by the current question.
The academic performance is greatly affected by learning behaviours, which indirectly determines the emotional state of the learner.If a learner has a negative emotion, his/her learning behaviours will negatively affect his/her academic performance.Judging by the relationship between the numbers of learning confusion and checking analysis, learning confusion has a negative correlation with academic performance.

Conclusion
This paper designs an online English evaluation experiment to induce learning confusion among students.The facial expressions of the students were shot, and imported to a machine learning algorithm to realize automatic recognition of learning confusion.The parameters of the recognition model were optimized and evaluated, indicating that RF is the best classifier for this task, with a classification accuracy of more than 70%.Next, the data on learner behaviours were collected through a series of online English evaluation experiments, and fully analysed to disclose the relationship between learning confusion and academic performance.In addition, a machine learning prediction model was designed for academic performance, and used to clarify the relationship between learning confusion and the correctness of the students' answers to the test questions.

Step 1 .
Extract repeatable samples from the original dataset by bootstrap sampling, and divide them into N training sets.• Step 2. Construct the DT algorithm for each training set, and collect the classification results for each sample in the N training sets.• Step 3. Determine the final classification result by voting, and allocate the samples with many results to the target class.

Fig. 3 .
Fig. 3. Relationship between the number of checking analysis, number of confusing questions, and total score

Table 1 .
, No. 08, 2021 classified positive samples; the TN refers to the number of correctly classified negative samples.Table 2 compares the metrics of six classifiers on the classification of learning confusion.Metrics of classifier performance

Table 2 .
Classification performance of six classifiers

Table 3 .
Behavior data on the 50 students

Table 4 .
Prediction accuracy of different methods