Intelligent System Using Deep Learning for Answering Learner Questions in a MOOC

hamal.oussama@gmail.com Abstract— Despite the great success of Massive Open Online Courses (MOOCs), the success rate of learners remains very low. Therefore, a lot of research has been done to understand and solve this abandonment problem. This work is part of the same effort which aims to improve learning in MOOCs. We offer an intelligent system capable of assisting the learner by providing answers to all his questions on the subjects covered in the MOOC. The architecture of our system is based on new advances in artificial intelligence, in particular the applications of deep learning in the field of natural language processing (NLP). The results obtained are quite interesting and demonstrate the relevance of our


Introduction
Several works have been carried out in order to face the great challenge of the dropout of learners in MOOCs. Research in the literature shows different aspects of this problem. Ranging from studies on estimating dropout rates, these reasons and to remedial mechanisms.
The problem of very high dropout rates in MOOCs has been seen since the beginning of their development and continues to this day. Indeed, this work [1] presents studies on several MOOCs carried out at the start of the significant development of this online training mode. Completion rates were very low, around 5% for one of the first edX platform courses launched in 2012. Another study of six University of Edinburgh courses that were launched on the Coursera platform in 2013. These courses covered quite a variety of fields including social sciences, engineering sciences and medicine. Out of a total of 309,000 registered learners, only around 12% have completed their training. More census statistics on 1000 courses on the Xuetang platform indicate a completion rate of 4.5% [2].
Many causes were also mentioned, on the one hand from the learner's side, such as the lack of prior intention to complete the course, lack of time and insufficient knowledge and skills required, etc. On the other hand, there are also reasons related to the training itself such as the structure of the training, its difficulty, the lack of interac-tivity etc. [3] [4] Thus, the reasons for dropping out differ from one learner to another and the same from one course to others.
Starting from the literature, we distinguish two main families of strategies for remediation. The first is to identify learners at risk of dropping out and then consider targeted interventions for these learners. The second strategy is to integrate personalized support into the MOOC to assist each of the learners according to their specific needs. Regarding the first approach, a lot of studies have been carried out on the prediction of cases of dropout [3] [5] [6]. This step is considered very important in order to better orient interventions aimed at reducing dropouts. These approaches are based on machine learning classification techniques, the applications of which have been made possible by exploiting data already existing on MOOCs. But the problem with these classification algorithms is, on the one hand, that they do not provide information to explain the predictions. To be effective, targeted remediation must take into account the specific causes that may lead to the abandonment of a learner. On the other hand, there is no measurement on the quality and quantity of data needed to enable effective machine learning. This leads to legitimately questioning the results of these so-called predictions. Assuming that these predictions are reliable, it would therefore be possible to make interventions targeting the predicted learners. The authors of this study [7] propose sending a motivational email to learners identified at risk of dropping out. But they themselves saw the ineffectiveness of this approach. The second strategy favors the implementation of personalized assistance tools such as intelligent tutors [8] [11] and conversational agents [12] [17].
The main contributions of our work are the design of an intelligent system to answer learner questions. Thus, after consulting a resource, the learner will eventually have various questions that he can ask directly to the system. Through this form of support, learning in a MOOC would be more interactive and engaging. Our system uses the most recent advances in question answer systems based on deep learning techniques [30].

Question answering based on deep learning
From the dawn of artificial intelligence, the ability of a machine to understand and use natural language was considered the primary criterion of its intelligence [18] [19 [36]. The very famous Turing test aimed to assess the intelligence of a machine by determining whether it could conduct a dialogue in natural language with a human interlocutor without the latter being able to distinguish whether he is talking with a machine or a human. As a result, the language component of artificial intelligence has been central since the very beginning of this field. The great success of artificial intelligence applications in automatic language processing has made possible the development of high-performance question answering systems beyond humans [25].
In deep learning, a question answering system is represented by a more or less complex architecture of neural networks taking at the input a textual question in natu-ral language and a text document [23] or an image [21] [22]. The output should match the correct answer if it exists in the document or image provided.

Datasets
To allow the development of increasingly efficient question answering systems, many datasets have emerged. They differ in several aspects including the areas of knowledge covered, the sources of the questions, the nature of the questions, the type of answers and other general characteristics to the datasets such as the number of examples.
SQuAD [24] [25], presents two versions, the first with around 100,000 questions on various Wikipedia articles. The second version adds 50,000 questions whose answers do not exist in the articles to also train models to detect them. The questions are factual and their answers correspond to portions of text in the articles. Among the most efficient models on this dataset, we distinguish [26] [29].
FQuAD [30], is largely inspired by the previous dataset, it is based on wikipedia articles in French language. Approximately over 25,000 questions for version 1.0 and over 60,000 questions for version 1.1. The CamemBert is a state-of-the-art model on this dataset.

MOOCQA: A system for questions answering in a MOOC environment
Today, deep learning models for question answering are maturing more and more. We propose an architecture in order to integrate them efficiently in a MOOC environment. The principle of our system is to find the answers to a learner's question formulated in natural language and relating to the topics covered in the MOOC. Given a question, the system first identifies the most relevant resources and then extracts information that may be the correct answer. To do this, the system includes a knowledge source, a data preprocessor, a resource selector and an answer extractor.

Knowledge source
The educational resources within the MOOC come in several formats: videos, text documents covering various activities such as courses, quizzes, forums etc. These resources are the knowledge source for our question answering system. Each resource is represented in text and positioned in the tree structure of the knowledge source based on the structure of the MOOC.

Data preprocessor
We distinguish two types of preprocessing, namely video transcription and digital text encoding. All videos are transcribed to have text documents; this is the transcription phase. This phase does not necessarily require a tool because the videos are often accompanied by their transcription. Then, we have the encoding which consists in the passage from the textual representation to a vector representation of the question as well as of the resources. Several techniques are available for this, such as word embedding techniques [31] [34]. The choice depends on the models used in the selection and extraction steps.

Resource selector
The role of this component is to identify the most relevant resources by considering the learner's question. Thus the breeder determines an ordered list of K resources using a method such as TF-IDF [32] [37].

Answer extractor
Among the resources returned by the selector, the extraction of the answer depends on the type of resources. If this is a text document that presents a notion, then the answer extraction from the document is done using a deep learning question answering model [26] [29] [35]. If this is a forum or practice quiz, then the extractor returns the existing answers.

Experiments and results
In order to assess the effectiveness of our solution, we performed an experiment testing the essential component of the proposed architecture. This is the answer extractor. The experiment consists of applying a question answering model on educational content coming from real MOOCs.
To do this, we used a question answering model intended for the French language provided by the Transformers library from Hugging Face. Being based on the Camembert language model and pre-trained on three datasets, PIAF, FQuAD, SQuAD-FR. This model of question answering in French achieved good F1-score performance [33]. We then took some passages from the course transcripts of the MOOC entitled "Deep learning" launched on the FUN platform. For each passage, we have formulated questions, some of which have their answers in the passage concerned and others not. Table 2 presents the passages, questions and answers extracted by the model. We can see the formal neuron as performing a binarization step where the neuron is activated if the value of the combination of weights is greater than this threshold b. "On peut voir le neurone formel comme effectuant une étape de binarisation où le neurone est activé si la valeur de la combinaison des poids est supérieure à ce seuil b." The threshold varies from neuron to neuron, or is it the same for all neurons in a network?
"Le seuil varie d'un neurone à l'autre ou au contraire est-il le même pour tous les neurones dans un réseau ?" First of all, the formal neuron is limited to binary classification problems, that is to say, that we will predict 2 output classes, the class plus 1 or the class minus 1, thus adapting to this kind of context. "Tout d'abord, le neurone formel est limité à des problèmes de classification binaire, c'est-à-dire, qu'on va prédire 2 classes en sortie, la classe plus 1 ou la classe moins 1, donc s'adapter à ce genre de contexte." What is a binary classification problem?
"Qu'est ce qu'un problème de classification binaire ?" [that we will predict 2 output classes] [qu'on va prédire 2 classes en sortie,] The general objective of supervised learning will be to minimize an average cost function over the learning set.
"L'objectif général de l'apprentissage supervisé sera de minimiser une fonction de coût moyenne sur l'ensemble d'apprentissage." In supervised learning, what does the minimization of an average cost function over the learning set represent? "En apprentissage supervisé, que représente la minimisation d'une fonction de coût moyenne sur l'ensemble d'apprentissage ?" [The general objective] [L'objectif général] In this context, we will be able to calculate the gradient of the cost function with respect to the parameters of the model w, which make it possible to predict the output with respect to the input. "Dans ce cadre-là, on sera capables de calculer le gradient de la fonction de coût par rapport aux paramètres du modèle w, qui permettent de prédire la sortie par rapport à l'entrée." What is the use of the gradient?

Discussion
The experiment carried out covers only a single component of the proposed solution. So the results obtained do not allow a complete examination of our architecture at this stage. But knowing that the extractor of answers to the questions is the central element, then its performance is a determining criterion for the effectiveness of MOOCQA.
Out of six randomly selected passages, the model provided the answer for each. Of which there are four cases where the answer exist in the passage and two other cases where the answer did not exist in the passage. Although this is a fairly small test sample, these results remain interesting. Indeed, as presented in Table 1, the model has already been evaluated on fairly large volumes of data and has obtained good performance. Thus the specificity of the results of our experiment lies in the context of the application. That is to say, the use of the model on data containing real educational values, in particular the passages and questions proposed in the experiment.
The results indicate that it would be interesting to integrate this type of model in MOOC environments. In particular according to the architecture of MOOCQA, because it makes it possible to exploit all of the resources of the MOOC by extracting the most relevant information given a specific question.
However, one of the limitations of our system is the possibility of not providing answers to the learner's question. This happens when the system does not find a relevant answer in the source of knowledge. On the other hand, the system response will always be a textual portion of an existing resource in the knowledge source. As it stands, the system does not have a mechanism for synthesizing several resources and generating responses. Therefore, the understanding, analysis and synthesis questions will not be answered, and these types of questions are generally essential in a learning process.

Conclusion and perspectives
In this work we have proposed a system of answers to questions that can exploit all the resources within a MOOC environment. Our system has the advantage of easily integrating into any MOOC or online learning platform. Its architecture is very flexible, it could include any models of question answering in deep learning. The results of our experiment indicate that the Camembert-based question answering model for the French language would be very effective in a MOOC environment.
Furthermore, in order to achieve a more comprehensive assessment, the next step in this work would be to integrate MOOCQA into various online course environments covering several scientific and literary fields. This allows us to collect a variety of data that will be used to perform a deeper analysis on the real impact of our system. Also to estimate the influence of MOOCQA on the performance of learners, their motivations, the dropout rate etc.