A Framework of an Intelligent Education System for Higher Education Based on Deep Learning

—Intelligent learning platforms and education information application platforms are gaining ground, owing to the wide application of modern technologies such as the Internet of Things, big data analysis, artificial intelligence, and cloud computing. However, the current platforms cannot solve specific teaching problems, and the relevant research mostly focuses on primary and secondary education. Therefore, this paper constructs and analyzes a framework of intelligent education system for higher education based on the deep learning. Firstly, the functional block diagram of the system was built up. Next, a face detection algorithm was proposed based on the multi-task convolutional neural network, a face recognition algorithm was developed based on the improved deep convolutional neural network, and the knowledge learning status of students was tracked based on the memory augmented neural network. Finally, the proposed framework was proved effective and swift through experiments. The research results expand the application scope of the deep learning in education.


Introduction
With the rapid development of the Internet technology and the popularization of smart terminals, the digital and information-based educational methods are also constantly improving.Now modern technologies such as the Internet of Things, big data analysis, artificial intelligence and cloud computing are being widely applied, giving even more space for the development of intelligent learning platforms and education information application platforms [1][2][3].Higher education is specialized and vocational education people receive after completing their secondary education.In recent years, it has developed at an unprecedented speed and gradually transformed from elite education to popular education [4].Intelligent education systems are also needed in various parts of higher education, as they can bring great convenience to teachers' teaching process and also provide students with a new learning way.
Currently, the research on intelligent education mainly focuses on the realization of such functions as interactive learning and homework review, or the exploration of teaching modes and teaching service strategies [5][6][7].Huang et al. [8] first adopted the literature analysis method to analyze the application and research status of smart learning platforms in higher education and vocational education both at home and abroad, and then proposed a "four-stage and nine-part" smart education and teaching application model, set up an experimental class and a control class to compare the effectiveness of platform applications, and verified that the proposed teaching model could have positive effects on students' learning efficiency and academic performance.Modern intelligent educational robots have integrated electronics, computer, sensing and other technologies.As an open teaching tool, they have been applied in various discipline competitions and special teaching scenarios [9][10][11][12][13].Crow et al. [14], based on the functional requirements, gave the design processes of educational robot software and hardware for classroom teaching in primary and secondary schools.The designed system can realize free configuration of port resources and have universal and scalable interfaces.Yong [15], based on the analysis of the functional requirements of regional intelligent education, gave the detailed design of the supporting intelligent education information platform, including its topological structure, network structure, functional modules and related teaching resource database, and then carried out system function and performance tests.Afanasyev et al. [16] aimed to develop a digital teaching system for medical morphology.Based on the research results of market demands, it presented the overall design plan and detailed development process of the system, including its strategic, scope, structural and visual aspects, and made a design summary based on the performance test.The data collected by intelligent education platforms are of various types and forms and thus can only be efficiently organized and processed with technical support.Babic et al. [17] studied the automatic collection, fusion and verification methods for multisource heterogeneous data adopted by an intelligent education platform, and then constructed a dynamic learning model for the platform and a knowledge graph for different disciplines, and realized the prioritization of knowledge points and established an accurate exercise recommendation algorithm.
Through review of the existing research results, it can be found that scholars at home and abroad seldom consider the particularity and functional advantages of the intelligent education platform system framework.They have not built any application platform that can solve specific teaching problems, nor have they developed corresponding application models [18][19][20][21][22].In addition, the existing research mostly focuses on primary and secondary education, and little has been done on higher education and vocational education.To this end, this paper constructs an intelligent education system framework for higher education based on the deep learning algorithm, which is widely applied in the field of computer vision, and also conducts related research.The paper consists of the following parts: firstly, it presents the functional block diagram of the deep-learning-based intelligent education system for higher education, and describes in detail the processes of class attendance, class status monitoring and knowledge status monitoring; secondly, it introduces the algorithm principles of the face detection based on the multitask convolutional neural network and the face recognition module based on the improved deep convolutional neural network; thirdly, it introduces the basic algorithm principle of the knowledge learning status tracking module based on the memory augmented neural network.The experiment proves the effectiveness and rapidity of the constructed model.The core purpose of the intelligent education system for higher education proposed in this paper is to impose intelligent education management on colleges and universities, that is, to achieve class attendance management, class status monitoring and knowledge status monitoring with the aid of the face detection, pose recognition, face recognition and knowledge status tracking technologies, perform statistical analysis of students, classes and teachers based on the class attendance, class status and knowledge status data output by the modules, and help teachers and other teaching staff control classroom teaching quality.Figure 1 shows the functional block diagram of the deeplearning-based intelligent education system for higher education.The system functions mainly include class attendance, class status monitoring, knowledge status monitoring and learning report analysis.
The real-time surveillance video of the class will be transmitted to the cloud platform through the real-time streaming protocol for video frame processing.Figure 2 shows a flowchart of class attendance and class status monitoring.The server performs student face detection, pose recognition and identity recognition on the framed pictures and uploads the results to the database.It also analyzes the knowledge status based on the log data of the interactions between students and the intelligent education system like learning and exercise data, and completes the performance statistics and curve plotting of participants in daily teaching.

Face detection module
In this paper, the multi-task convolutional neural network (MTCNN), which inherits the cascading idea of V-J classifier, to the student face detection task at the stage of class attendance and class status analysis in the intelligent education system.With the three cascaded networks P-net, O-net, and R-net in MTCNN, the system can rapidly eliminate non-face areas in the early stage of face detection.
The MTCNN with three outputs has three training loss functions.Suppose the input sample is ap, the probability that the neural network identifies a student's face is γp, then the binary cross-entropy loss function can be expressed as Formula (1): where, lp TC indicates whether the input sample image is a student's facial image, and 0≦lp TC ≦1.Formula (2) is the expression of the regression loss function for face candidate regions: where, l'p CA and lp CA are the predicted position of the face candidate region output by the network and the marked actual position of the student's face region, respectively.It can also be transformed into the Euclidean distance loss shown in Formula (3): Suppose the weighting coefficients of the three training loss functions are represented by φ, and the binary function δ is used to characterize whether the training sample is positive or negative.For a positive sample, it is 1, and for a negative sample, it is 0, which means that the regression loss shown in Formula ( 2) is calculated only when δ=1.Then the overall loss function of MTCNN can be expressed as Formula (4): The weighting coefficients of P-net, O-net, and R-net are different.Except the Onet, which needs to choose a larger φ in order to enhance its ability to detect key pixels, other networks can choose a small φ.
To detect students' faces, MTCNN requires the input images to be of a fixed size, so the images to be input need to undergo pyramid processing to reduce its size.In this way, the speed of the algorithm can be effectively improved.When the size is reduced from 640×480 to 320×240, MTCNN can run 3-4 times faster.After the student' facial image pyramid is generated, the image data with a size of 320×240 is normalized according to Formula (5): ( ) The full convolutional network (FCN) in MTCNN receives the image data whose pixel values are controlled within the range of (-1, +1), and then uses a sliding window to traverse all the pixels in the students' facial image pyramids.Figure 4 shows the structure of the R-net.The input of the R-net is the images output by the P-net that have undergone the non-maximum suppression operation and been scaled to a fixed size.The R-net, which uses a fully connected layer for output, can deliver relatively more accurate results regarding students' face regions and related probabilities. Figure 5 shows the structure of the O-net, which receives the images output by the R-net, which have undergone the non-maximum suppression operation and been scaled to a fixed size.Like the R-net, the O-net is also used to further correct the predictions about students' facial regions and related probabilities.
In the face detection module, all the three network layers P-net, R-net and O-net output lp CA , lp ED and FP.In actual applications, only the lp ED output from the O-net is retained, and the output from the P-net and the R-net is discarded.Finally, the face pose of the student is analyzed based on lp ED .lp ED includes the left and right eye corners lp REC and lp LEC , the nose lp N , the left and right mouth corners lp RMC and lp LMC .It is assumed that, when a student is directly facing the camera, the distances from the left and right corners of the eyes and the left and right corners of the mouth to the nose satisfy the following formula: The tilt factor of a student's face pose is defined as Formula (7): When ηT≧0.5, it can be determined that the student's face tilts too much, which can be used as the basis for evaluating the student's performance in class, but cannot be used as the basis for identity recognition.When ηT≤0.5, the student's facial image processed by the face detection module can be input into the face recognition module for further processing.

Face recognition module
In the practical application of an intelligent education system, the images of students in class cannot be regularly collected, so it is impossible to perform neural network training just using students' image samples.However, using feature vectors to describe students' facial information can address the problem of zero samples in actual scenarios.The deep convolutional neural network can extract the salient features of the input facial images, and then analyze whether the two images are from the same person by calculating the distance between the feature vectors.Figure 6 shows the extraction process.

Fig. 6. Extraction process of students' facial feature vectors in the face recognition module
The feature vector extracted by the deep convolutional neural network can be expressed as FV = G(x), and the Euclidean distance between the vectors can be calculated by Formula (8): It can also be expressed as the cosine distance shown in Formula (9): The distance between the vectors is compared with the preset threshold DisT.If it is smaller than DisT, the two images a1 and a2 can be deemed as from the same person.

Fig. 7. Structure of the residual module
The deep convolutional neural network used for face recognition often cannot be loaded and run on the existing hardware platforms due to its relatively complex structure, large number of parameters and high computing burden.Therefore, the residual module, the compression and excitation module, and the maximum feature map pooling module were introduced into the classic convolutional neural network in this paper to realize the lightweight of the network structure unit.Figure 7 shows the structure of the residual module.Having a jumper connection structure, the residual module superimposes the student's facial image a processed by the face detection module onto the two images a''s that have been convolved by two identical convolutional layers and outputs: The residual module can back propagate the gradient of the network loss function fast through the jumper connection structure, effectively avoiding the vanishing gradient and exploding gradient.

Fig. 8. Structure of the compression and excitation module
Figure 8 shows the structure of the compression and excitation module.The output of the residual module enters the compression and excitation module as the input.The compression and excitation module first performs a global average pooling of the images of the students in class with a size of W×H in X feature map channels and obtains X 1×1 compressed feature images that can characterize the inter-channel information of the input images.The compressed feature images are transferred through two fully connected layers, and processed by the excitation function -the sigmoid function to obtain X probability values within the (0, 1) interval, which are then multiplied by the X input images.After that, the results are output.In this way, the compression and excitation module can highlight the feature information of the images through the enhancement or suppression of neural network learning.
The main function of the maximum feature map pooling module in the network is to quickly reduce the dimensions of the X images output from the compression and excitation module.Figure 9 shows the computation process of the module.First, the X images are divided into X/2 groups, and the two images in each group are compared in terms of pixels, and the points with larger pixel values are output to form a new image.In other words, through the maximum feature map pooling module, the feature map channels of the images can be reduced by half.The deep convolutional neural network with a large number of feature map channels can control the number of feature map channels within 400 by setting multiple maximum feature map pooling modules.This can effectively reduce the computing burden and resources occupation rate of the network.

Implementation of the Knowledge Learning Status Tracking Function Based on Deep Learning
The intelligent education system needs to scientifically and effectively track students' knowledge learning status, so as to customize learning paths for them based on their weak spots in knowledge.The ideal knowledge status tracking model needs to be able to dynamically track how fast a student absorbs knowledge, and to process the repetition rate of exercises and the recommended order of exercises.Figure 10   Mt is calculated by the weighting module, in which, the weight value ωt must be properly set, as it characterizes how well a student has mastered the knowledge point and affects the accurate evaluation of the probability of correct answers.As shown in Figure 10, to obtain ideal weight values, the one-hot encoding of each knowledge point KPt needs to be processed by the embedding layer, which maps discrete variables to continuous vectors, and then the dot product of the continuous vector KPCt obtained and the external memory matrix module at the previous moment LSt-1 KPC= (LS1 KPC ,LS2 KPC ,…,LSn KPC ) is calculated.Finally, the Softmax function shown in Formula ( 11) is used for processing: ( ) The calculation of the probability of correct answers to exercises is a prediction process of the student's knowledge learning status.Specifically, for the knowledge point KPt, each memory unit of the external memory matrix module is weighted by ωt in Formula (12): http://www.i-jet.org The calculated Pt, which represents the overall learning status of multiple knowledge points by a student, is connected with KPCt.A fully connected layer is input, and then the excitation function ReLU is used to characterize the difficulty of the knowledge point and how well the student has mastered it: At last, the excitation function Sigmoid is used to predict the probability of correct answers to exercises: ( ) In order to update students' absorbance of knowledge, the external memory matrix module needs to be updated.Let the vector Dt represent the gain a student receives after finishing learning the knowledge point and answering the questions correctly.This paper innovatively used Dt as a replacement of KPt and Mt to update the module.As shown in Figure 10, ωt(k)Dt is input into the memory units in the k-th row of LSt, of which each memory unit has the same LSTM function.The updated external memory matrix module can be expressed by Formula (15): where, LSt(k) is the k-th row of LSt.The loss function of the model can be calculated by Formula ( 16):

Experimental Results and Analysis
Considering the intelligent education system for higher education is applied in college classrooms, in order to ensure the face monitoring accuracy of the system under various interference factors such as different lighting or students lowering their heads or blocking out each other, the experimental data set for the face detection module consisted of randomly shot videos by surveillance cameras in 40 ordinary college and university classrooms, with the video of each classroom lasting for 60 seconds.After video framing, 15,000 photos were randomly selected to form the training image set and test image set, of which, 12,000 were for training and the rest for testing.Figure 11 shows the changes in the loss curve of the face detection module.It can be seen that the loss function value of the network decreased rapidly before the number of iterations reached iJET -Vol.16, No. 07, 2021 5,000, and then it started to decrease more slowly after about 5,000 iterations, and tended to converge after about 7500 iterations.
The training, validation and test image sets used by the face recognition module consisted of about 26,000 facial images obtained through cropping of the students' facial images detected by the face detection module, of which 18,000 formed the training set, while the other 8,000 formed the validation and test sets.Figure 12 shows the changes in the loss curves of the face recognition module before and after the face detection module was applied.It can be seen that the network training loss of the face recognition module based on face detection was lower than that without face detection, and that the curve showed a much more obvious downward trend.Table 1 shows the test results after successful face model training.Figure 13 shows the changes in the recognition accuracy of the face recognition module before and after the face detection module was applied.The test results show that the accuracy of the model for face recognition reached 96-97%, and that the running time of the model was shorter than or equal to 3s, proving the effectiveness and rapidity of the model.

Conclusion
This paper constructed an intelligent education system framework for higher education based on the deep learning algorithm and carried out related research.It first built the functional block diagram of the deep-learning-based intelligent education system for higher education, which consists of four major functions: class attendance, class status monitoring, knowledge status monitoring and learning report analysis.Then it introduced in detail the face detection algorithm based on the multi-task convolutional neural network and the face recognition algorithm based on the improved deep convolutional neural network.The experimental results show that the accuracy of the model in face recognition reached 96-97%, and the running time of the model was shorter than or equal to 3s, which proved the effectiveness and rapidity of the model.The last part of this paper realized tracking of students' knowledge learning status based on the memory augmented neural network.Through an experiment, this paper verified that the model constructed has better anti-overfitting performance than other models.

Fig. 1 .
Fig. 1.Functional block diagram of the deep-learning-based intelligent education system for higher education

Fig. 2 .
Fig. 2. Flow chart of class attendance and class status monitoring

Paper- A
Framework of an Intelligent Education System for Higher Education Based on Deep Learning

Fig. 3 .Fig. 4 .Fig. 5 .
Fig. 3. Structure of the P-netImages are first screened by the P-net, whose structure is shown in Figure3.According to the position of the student's face lp CA , the position of the key point lp ED , and the probability that the student's face exists FP in the image output by the P-net, the face candidate regions with lower probabilities can be eliminated through the non-maximum suppression operation.

Fig. 9 .
Fig. 9. Structure of the maximum feature map pooling module shows the proposed knowledge status tracking model based on the memory augmented neural network.iJET -Vol.16, No. 07, 2021 It can be seen from the figure that the model consists of an external memory matrix module, a weighting module and a recurrent neural network module.The input of the external memory matrix module model is the data sequence of a student's historical knowledge point learning status LSt=[(KP1,M1),(KP2,M2),…,(KPn,Mn)], and the sequences KPt=[KP1,KP2,…,KPn] and Mt=[KP1,KP2,…,KPn] respectively represent the tag of each knowledge point at the time t and the probability of correct answers to exercises that characterizes how well the student has mastered each knowledge point.

Fig. 10 .
Fig. 10.Structure of the knowledge learning status tracking module

Paper- A
Framework of an Intelligent Education System for Higher Education Based on Deep Learning

Fig. 11 .
Fig. 11.Changes in the loss curve of the face detection module

Fig. 12 .
Fig. 12. Changes in the loss curve of the face recognition module

Table 1 .
Test results of face recognition