Detecting Students Gifted in Mathematics with Stream Mining and Concept Drift Based M-Learning Models Integrating Educational Computer Games

—One of the problems of individualized classes which adapt contents and methods of teaching to students of different cognitive capabilities is early and widely available detection of students gifted in certain educational fields. The paper proposes models which are based on stream mining and which can detect students gifted in Mathematics solely on the basis of their interaction with the m-learning system using educational computer games and with no access to any other feature except for student age. Classification accuracy and time-efficiency of different feature selection methods are examined in order to make the models more interpretable, hence less complex. Stream mining classification accuracy in the utilized models is evaluated on new (yet unseen) records, while the concept drift detection analyses at which point of time new models should be built.


Introduction
Educational data mining (EDM) is an interdisciplinary field which is concerned with the development of methods for conducting research on educational data, which can also be applied in analysing data from different e-learning systems. Besides applying traditional techniques for data mining, the approach requires certain adaptations specific to education-related problems, such as improving the educational process or predicting student success [1].
Transformation and adaptation of e-learning systems to m-learning facilitates their wide availability. It makes learning available anytime and anywhere [2]. In the last four years, research in the field of m-learning has grown almost 67.6% [3].
In order to improve the educational process by adapting to specific needs of students gifted in certain fields, it is necessary to detect these students as early as possible. One of the ways to accomplish this is to use decision trees for classifying students and content [4]. This paper focuses on detecting primary school students gifted in Mathematics.
In order to facilitate the development of mathematical creativity with students in the educational process, one has to make sure that the choice of learning problem tasks coincides with their interests, understanding, skills, and that creativity is credited for as soon as it is spotted [5].
All the components of big data knowledge discovery are utilized in this research. However, since classical data mining models are not appropriate for data streams generated by the interaction with educational computer games in m-learning systems, stream mining algorithms are employed in the research [6].

Related Work
Feature selection enables reducing complexity of data mining models by decreasing the number of dimensions, facilitating generalization, speeding up the process of learning, and facilitating model interpretability [7].
Majority of machine learning algorithms already entail feature selection. However, it has been shown in practice that adding just one additional feature with random binary values decreases classification accuracy of decision trees (C4.5) for 5-10%. This is due to the inner workings of the splitting in decision trees. With each new tree depth level, less data is available for choosing a feature for splitting the data, which is why even a newly added feature might seem important to the algorithm. This issue is even more pronounced when the dataset contains a great number of irrelevant features [8].
Besides the methods already integrated in classification algorithms, two additional approaches to feature selection can be used in order to get the reduced dataset [9]: 1. Independent assessment based on general data characteristics (filter method) 2. Applying machine learning algorithm which will eventually be used for training (envelope method).
The first approach takes place prior to learning the prediction model and filters the feature set in order to obtain the reduced dataset which is as good as possible before training, while the second approach filters features while learning a predictive model by selecting a subset of features which gives the best accuracy.
Static datasets are finalized datasets which can be read several times either thoroughly or in different ratios, dependent on the model evaluation strategy, i.e. the partitioning of data into training and test sets. Unlike finalized datasets, data streams are being continuously updated and modified and may never end, which is why the following requirements and limitations are imposed [10]: 1. Every stream is observed only once 2. Processing time of a record must be short 3. Memory usage must be small 4. Results have to be available at any point of time 5. Streams evolve over time and hence represent non-stationary data sources.
Additionally, data in streams are analysed sequentially, in order of arrival. Random access is not available, and a record cannot be reused once it was processed. In order to reduce the processing time, it is necessary to define the upper limit which refers to the number of processes needed for processing a record. For real-time data analysis, algorithms have to process data at the same speed at which the data arrive, or faster. Memory used by the algorithms refers to the stored statistical data and the stored model, the two of which are often combined so that statistical data is an elementary part of the model while making predictions. A model has to be built irrespective of the number of data records processed, and a prediction needs to be made even when only a small number of records was processed [11].
In order to develop decision making trends related to teaching and its advancement based on data mining, a data driven culture needs to be developed within teachers and institutions. One of the currently active areas of research is game learning analytics (applying data-mining and visualization techniques to player interactions) [12].
Outcomes are influenced by affective states, i.e. positively by concentration (engaged), and negatively by confusion and boredom, as evident from the analysis of learning outcomes in Online Mathematical Problem Solving. With the aim to reduce students' boredom, Chiu in [13] suggests including interesting problem-solving designs, such as games, into online mathematical problem-solving platforms.
Decision trees are one of the most often used algorithms in the data analysis of computer games (intended for fun) and in building predictive models of players, with the goal to enhance player experience [14]. Besides some other algorithms, they are also used with computer games which are primarily intended for learning. Findings of such research studies are important to teachers, students, parents, as well as game developers, and researchers [15].
Data stream processing supplemented with feature selection and analysis can be used in secondary schools for predicting STEM careers from student usage of an Intelligent Tutoring System [16].
In this paper, a learning model which uses computer games is implemented, and data stream mining is utilized for a preliminary selection of primary school students gifted in Mathematics. The feature selection and analysis in the computer game is important for building predictive models in order to reduce the time needed for building the model and to increase the interpretability of models. The proposed learning model enables monitoring motivation for using the system and learning outcomes achieved. It provides equal user experience in e-learning and in m-learning. The learning model integrates a social network adapted for primary school use.

Research Questions
For the purpose of this research, students gifted in Mathematics are defined as those primary school students who take additional school classes in Mathematics.
The research analyses whether it is possible to use data stream mining for detecting students gifted in Mathematics with a preliminary evaluation based on assessing motivation and acquired knowledge.
The following hypothesis is formulated: it is possible to apply data stream mining techniques for a preliminary student evaluation based on assessing motivation and acquired knowledge in a mobile-friendly model which uses educational computer games for learning Mathematics.
The hypothesis will be accepted if stream mining enables building models which detect students gifted in Mathematics based on their interaction with the mobile learning system which uses educational computer games and implements features which refer to motivation and acquired knowledge.
Since the research additionally questions how to reduce the complexity of the obtained models, effects of feature selection methods on processing time, dimensions, and model accuracy are also explored. The question of how distributional changes can be observed within the recorded data is answered by the concept drift detection.

Methodology
Since the research presented in this paper deals with the classification of students based on additional school classes in Mathematics, classification decision trees are used to ease the interpretation and visualization of the model for the teachers of Mathematics. These machine learning algorithms can be applied with data streams, and besides being easily interpretable, they are flexible, which means that it is possible to obtain different versions of a model with the aim to achieve higher accuracy. The disadvantage of using classification decision trees refers to situations in which slight changes in feature values can affect accuracy and the model [17]. This research, therefore, utilizes feature selection methods in order to reduce the number of features.

Feature selection methods
For the selection of relevant features, besides the envelope method which uses classification evaluators, this research also uses Hoeffding trees for data streams [18], and filter methods implemented in WEKA [19] which work with nominal class values: Correlation, Information Gain, Chi-squared, Gain Ratio. Filter methods assign weight to each feature dependent on its relevance for the class [20].

Stream mining algorithms
Classic decision trees process data by keeping all the data in memory, which is why it is not possible to apply them efficiently to data streams.
Classification decision trees which are used in this research and which process data by satisfying all the data stream requirements are Very Fast Decision Trees (VFDT), or the so-called Hoeffding trees. The Hoeffding tree exploits the assumption that a small sample is good enough to choose an optimal splitting feature. The Hoeffding bound is used for selecting the best splitting feature out of n samples [21].
With data streams, first records are used for selecting the root of the tree. After selecting the root feature, subsequent records are used for branching to leaves. The Hoeffding bound is used for determining the smallest number of examples needed at a node for selecting a splitting feature.
For a random variable r in range R with n independent observations, the mean value r̅ is calculated, and then the Hoeffding bound determines with a probability of 1−  that the real average of variable values is at least r̅ − ϵ , where ϵ = √ R 2 ln(1/) 2n .
Since the Hoeffding tree processes data streams, generated trees can grow infinitely, which collides with the memory usage requirement, which is why the VFDT algorithm limits the number of nodes. The algorithm estimates the error reduction achievable by each active leaf. When the memory limit is reached by the model, the algorithm deactivates less promising nodes. VFDT keeps observing inactive leaves and can re-activate them when they become more promising than those currently active. Besides, the algorithm eliminates features which do not look promising after splitting. It can also delete from the memory the statistics (class distributions for each value) related to these features.
This research also uses two extensions of the Hoeffding tree: the Hoeffding option tree and the Hoeffding adaptive tree.
The Hoeffding option tree is built out of a unique structure of multiple Hoeffding trees because it contains additional nodes which enable optional splitting and can thus change the tree building path [22].
The Hoeffding adaptive tree uses the ADWIN (Adaptive Windowing) algorithm based on the sliding window model which decides on the active parts and the ones to be discarded by analysing the window size. When their accuracy decreases, old branches are replaced with the new ones [23].
The evaluation of the Hoeffding tree-based models built within this research is conducted by calculating the algorithm accuracy in a way that every single model is always tested on yet unseen records since every single record is first used in testing and only then in training.

Concept drift
Unlike stationary data, data streams can contain different concept drifts, dependent on whether the reason of a change or the speed of a change is considered. A concept drift refers to the situations in supervised learning in which the relationship between input data and target outcome changes over time. A real concept drift refers to the cases in which the function separating different classes changes (function determines the probability of class affiliation), while a virtual concept drift occurs when there is a change in the distribution of data. With respect to time, a concept drift can occur suddenly, incrementally, gradually, or repeatedly, while anomalies do not have to lead to a concept drift if they occur only once or by mere chance [24].
In this research, the Drift Detection Method (DDM) is chosen for the analysis of concept drifts by comparing accuracy over different data streams, since it is considered best suited for datasets which exhibit sudden concept drifts [25].
The Drift Detection Method is independent of the learning algorithm used. It analyses whether the classifier makes a correct or an incorrect class prediction in the form of bimodal distribution, where the probability vi of an incorrect classification in the i th position of the sequence is calculated with the standard deviation si. A significant increase in classification errors signifies a change in the distribution of classes, which is why the algorithm used for building the model is no longer appropriate. Warnings and detection values are analysed, and in case of a concept drift, the learned model is reset to its initial state, as well as the values vmin and smin. A new model is build based on the records stored after the last warning [26].
Sudden concept drifts (change in the classification accuracy) in this research occur in cases in which, instead of incremental classification over the whole dataset, window-based classification is made. Dependent on the window size, different ranges of students are analysed. Therefore, ratios between classes and distributions of records of different students significantly vary, especially in cases of smaller window sizes.

Dataset
The dataset is collected from the student interaction with the game "Zagonetke mudrog lisca" (Riddles of the wise fox), which deals with mathematical problem tasks presented through the Roman numerals on 15 pre-defined levels ordered by increasing difficulty. Throughout the interaction with the game, the data related to the following features are stored: current level, element holding time (time taken to think), element played, hit, number of moves at the level, number of remaining moves, number of times played, scores gained by the move, session (the number of games played in succession), total number of moves, total score prior to the move. Besides the listed features, the system also stores timestamps of records and platforms used for accessing the system (desktop/mobile).
The student data are anonymized by nicknames and based on the model Anonymized Social Network-based Mobile Game System for Learning Mathematics [27]. Besides nicknames, student data includes also age and information whether students attend additional school classes in Mathematics. Additional school classes in Mathematics are used as preparations for competitions, and cover more advanced content in comparison to the regular classes and with respect to the age. Additional school classes in Mathematics are attended only by students gifted in Mathematics.
The sample on which the research is conducted consists of 104 students aged 11-14 years and enrolled from fifth to eighth grade of a primary school in the Republic of Croatia. Besides taking school classes in Mathematics, the total of 23 students out of 104 students attend also additional school classes in Mathematics.
The time period includes one week prior to winter holidays, three weeks of holidays, and two weeks after holidays in the academic year 2017/2018.
The system was accessed by 73 students (70.19% of students). The data is analysed with MOA, which is a tool for data stream mining with the concept drift detection [28].

6
Data Analysis and Results

Anomaly in the hoeffding tree accuracy at the first level of the game
During the course of the game, the students made 33081 move. Due to the unexpected deviation observed in the model accuracy, a total of 29212 moves is used for the analysis. More precisely, a significant deviation is detected in accuracy related to the records which refer to the first level. In the model of students gifted in Mathematics based on the Hoeffding tree, the problem is most pronounced for the first 5000 records with the window size set to 500, grace period of 200 records (the number of instances a leaf should observe between splits) and sample frequency of 1000. The whole dataset contains an abrupt decrease in accuracy on the initial records (which contain first level data) compared to the dataset without the first level records (Fig. 1). This is due to the observation that some students skip the first level by purposefully making mistakes (the game ends after three wrong moves) in order to access the menu at the end of the game which enables them to see the top list (Top 20) and compare their accomplishments with other students, and to check the number of medals won and whether anyone beat them. Additionally, the first level is used by the teachers for providing instructions and explaining different ways of solving tasks. The dataset is thus pre-processed to discard records related to the first-level moves. Fig. 1. Accuracy of the Hoeffding tree on the whole dataset and the dataset with the first-level moves discarded in the game "Zagonetke mudrog lisca" Table 1 shows feature selection methods and the respective rank values of features related to student activities in the game "Zagonetke mudrog lisca" with respect to the class of taking additional school classes in Mathematics. Relative ratios of all features which have values greater than zero according to the classification attribute evaluator, i.e. five features which have the greatest influence on the resulting classification model, are shown in Fig. 2.

Applying feature selection methods
In case of a correct answer, the scores gained by the move correlate with the element holding time (the amount of scores gained by answering correctly equals the time left after answering). That feature is therefore discarded in the reduced dataset of the game "Zagonetke mudrog lisca". Considering the fact that areas of Mathematics taught in different grades differ, student age is also used as a feature. The reduced set of features, which is compiled out of the ranked feature set, and which contains only 43% of features compared to the original feature set, contains the following features: number of times played, total score prior to the move, element holding time, session, age, and additional Mathematics (class).

Effects of feature selection on model accuracy
Classification models for the detection of students gifted in Mathematics (students who take additional school classes in Mathematics) are built with stream mining algorithms. Average accuracies of algorithms over the original and reduced feature sets are given in Table 2 Since the applied algorithms achieve similar results on the original and the reduced feature set (differences range from -0.67% to 1.16%), the complete model is built with the dataset of smaller dimensionality, i.e. the reduced dataset with the higher coefficient as the splitting confidence. Additionally, experiments show that training and model evaluation last twice shorter on the reduced dataset, i.e. it is more time efficient. The training times range from 0.22:0.45 for the reduced feature set up to 0.34:0.69 seconds for the original feature set, dependent on the algorithm. The highest accuracy model, which is also the most simple regarding the depth of the tree, is the adaptive one. That model represents students who take additional school classes in Mathematics as those who play the game more frequently with less games played in succession (session < 7) compared to those who do not take additional school classes in Mathematics.
By considering only features from the reduced dataset which refer to the learning speed, i.e. time taken to think dependent on the number of attempts, the resulting model describes students who take additional school classes in Mathematics as those who in less attempts spend less time to think. The time taken to think by the students who take additional school classes in Mathematics is over 2.8 seconds only for those students who play less times (number of times played <= 13), while students who do not take additional school classes in Mathematics are slower in solving tasks even afterwards. Taking into consideration that the number of played games refers to the motivation, and time taken to think to the level of acquired knowledge, the obtained model enables preliminary evaluation (additional Mathematics class YES and NO) based on assessing motivation and acquired knowledge.
Therefore, the following hypothesis is accepted: stream mining methods can be used for preliminary student evaluation based on assessing motivation and acquired knowledge in the model which is mobile-friendly and which uses educational computer games for learning Mathematics.

Concept drift in relation to the size of data window
Algorithm accuracy can also be analysed from the aspect of window sizes, which are determined in the following ways in the research: • 20 (maximum number of possible moves in one session, which therefore includes data related to at least one student) • 100 (obtained from the average number of sessions per student multiplied by the average number of moves made) • 1200 (for the maximum number of student sessions multiplied by the maximum number of moves made) • 5200 (for the maximum number of games played multiplied by the maximum number of moves made).
The learning curve in stream mining does not increase with the increase in the number of records, but oscillates and depends on the window size. Windows greater in size balance the curve (oscillations in accuracy are reduced).
The Hoeffding adaptive tree exhibits smallest standard deviations compared to the regular and option trees, due to discarding branches which cause a decrease in accuracy and replacing them with new branches with every window shift.
Unlike the incremental models in which the increase in the number of records is followed by the increase in complexity, the models based on the regular and Hoeffding option trees can be also built by including the concept drift. The concept drift detection has similarities with the Hoeffding adaptive tree. While the concept drift discards feature values which cause a decrease in accuracy on a certain number of records, the adaptive algorithm continuously adapts model to new values by making evaluations on each new record in order to achieve higher accuracy, which eventually leads to a complete model. Unlike with the adaptive tree, the utilization of the concept drift on the regular Hoeffding tree, which is not adaptive, leads to discarding a model whenever there is an accuracy shift and building a new model from the current drift up until a new concept drift. Fig. 3 gives a cumulative display of the depth of the regular Hoeffding tree before and after each concept drift for all the window sizes considered. The red dashed lines are used to label records at which concept drifts are detected for each window size. The complete (incremental) model has a tree depth of 10, while the model with the same algorithm settings which includes concept drifts gives a detailed tree (depthmax = 9) only with the largest window size. The smallest window size (20) does not result in greater changes due to frequent concept drifts (depthmax = 5).
On the given dataset, the obtained results show that larger window sizes enable training the model which includes concept drifts and which, in terms of complexity and accuracy (93.26%), approaches the complete model (93.98%).

Conclusion and Future Work
Models based on stream mining can be simplified by utilizing feature selection methods in order to retain only data relevant for the analysis. The sizes and training times of models obtained within this research are twice as small at no expense in accuracy. The dimensionality reduction, due to smaller number of participating features, leads to a reduction in the model complexity in terms of its interpretability.
This research shows that stream mining can be used for the detection of students gifted in Mathematics in mobile systems which are based on educational computer games.
Since data streams used for training the model often include concept drifts, up-todate models based solely on valid data can be obtained with the concept drift detection and inspection of different window sizes. Windows larger in size enable training models which include concept drifts, and, at the same time, approach complete models in terms of complexity and average accuracy.
Predictive classification models trained within this research can be used in preselecting a large number of students who are gifted in Mathematics, without knowing any other feature except for their age, i.e. students can be classified based solely on their interaction with the m-learning system which uses educational computer games.
In the presented implementation of the learning system, stream mining algorithms based on Hoeffding trees detect students gifted in Mathematics with 93.26% accuracy in case of the model with the concept drift detection, and 93.98% accuracy in case of the complete model. By considering and analysing features which reflect motivation and the level of knowledge acquired, i.e. number of times played and time taken to think, respectively, a model which enables preliminary student evaluation can be built.
The implemented m-learning system supports also other types of games, e.g. those with levels based on random variable values. This is left for our future work.