Use of Data Mining Technologies in an English Online Test Results Management System

—The systematic management of English online test results can ensure the fairness of the test, guarantee the accuracy and safety of the test results, and reduce the consumption of manpower and materials. Unfortunately, the existing data mining and management strategy for learner scores cannot track the learning process or score change of learners. This paper innovatively applies the trajectory data mining technology to the design of an English online process test results management system. After analyzing the functional requirements of the system, four basic information lists were constructed in SQL Server 2005. Then, an improved k-means clustering algorithm and the trajectory frequent pattern mining algorithm were combined to cluster the test results and analyze the learning trajectory deviation of the learners. Next, the four system functions were detailed, including login, entry of test results, trajectory setting, and deviation analysis. The effectiveness of our algorithm, and the performance of our system were fully verified through experiments.


Introduction
Online education can break the time and space limitations of teaching and learning; therefore, it has been developing and booming vigorously in recent years. Mature online education platforms generally have a wealth of learning resources, they can support simultaneous access of multiple users, they are fast, convenient, and have greatly improved the learning efficiency of learners. Since online education has diverse forms and flexible teaching methods, the methods for examining the learning results of learners are relatively complicated [1][2][3][4]. Especially for English online tests, problems such as the massive volume of questions, the many and various question types, the heavy workload of test score statistics, and the low score query efficiency, are all issues demanding prompt solutions [5][6][7][8]. Building English online test results management systems can ensure the fairness of the test, guarantee the accuracy and safety of the test results, and reduce the consumption of manpower and materials [9][10][11][12].
Now the results management systems have been widely used in schools at all levels at home and abroad, and field scholars have conducted various research on them [13][14][15][16][17][18][19][20]. For example, based on the analysis results of the management requirements of learners in colleges and universities, scholars Thekdi and Aven [21] designed an overall functional framework of the results management systems for learners in colleges and universities, and explored the potential development law of learner performance from the perspectives of the teaching, management and decision-making of higher educational schools. Based on JSP and MySQL, Kreiling and Bounfour [22] developed the web server and database system of a learner test results management system and tested the system for its functions and performance. Moritz et al. [23] adopted a more convenient B/S system mode, with ASP.NET as the network programming framework, they built a learner test score management system that includes functions such as learner login management, course management, and test score query, etc., realizing the tracking of the learning cycle and test score change of learners. In order to realize the no-refresh interaction between the learner service-end and the system server data, Stadnicka et al. [24] employed the object-oriented design tool PowerDesign to model the college learners' admission score management system based on the jQuery architecture, and realized system functions such as the management of basic data of learners, inquiry of enrolment information, enrolment business process management, freshman enrolment process management, and class management, etc. In order to update of the traditional performance data storage mode, Bellisario and Pavlov [25] used the IoC(Inversion of Control) container of the Spring framework to develop a college entrance examination score management system; to reduce the coupling degree between functional modules, they designed a college entrance examination score management system with distributed storage under the Hadoop platform, which had met the requirements of the storage and processing of massive college entrance examination score data. The college learner performance management system constructed in Winter et al. [26] can predict learner performance, analyse the association rules of key influencing factors of learners' test results based on the Apriori algorithm, and classify the learner performance improvement methods based on the decision tree algorithm. The test results of online education can not only examine the knowledge mastery process, degree, and progress of learners, but also contain important information about the teaching quality of teachers and the evaluation of online education modes. The Hadoop-based test results management system constructed in Sonmez and Pintelon [27] can provide online learners with the functions of querying their test score ranking and daily performance; at the same time, based on the analysis of association rules between the scores of various subjects, it can also provide learners with the function of recommending the learning trajectory with learning focuses. Based on data warehouse, association rule analysis, and OLAP technology, Ecem Yildiz et al. [28] extracted the feature attributes of the data of online learner test results management systems, and constructed an online test results prediction model and a classification model based on an improved PSO-K-Means algorithm.
Nowadays, data mining techniques such as association rules analysis, regression analysis, and clustering have already been widely used in the education field. However, the existing learner test results data mining and management methods mostly target at the processing of learners' final grades, while ignoring their daily grades, therefore, it is impossible to realize the data mining of the trajectory of the learners' learning process and test result change. This paper innovatively applies the trajectory data mining technology to the design of an English online test results management system. The main content of the paper is organized as follows: Chapter 2 analyses the functional requirements of the system, and constructs 4 basic information tables based on SQL Server 2005, including the teacher and student information, the course information, the process test results of individual subjects, the course test results, and other relation modes. Chapter 3 combines the improved k-means clustering algorithm with the trajectory frequent pattern mining algorithm to cluster the test results and analyse the trajectory deviations of learners. Chapter 4 details four system functions, including login, entry of test results, learning trajectory setting, and deviation analysis. Chapter 5 verifies the effectiveness of the proposed algorithm with experimental results, and examines the performance of the system from the perspectives of two indicators, the query response delay, and the data operation delay.

Function Requirement Analysis and System Outline Design
In order to build a reasonable process test results management system for online English tests, at first, we need to analyse the function requirements of the system based on interview and survey results. Figure 1 shows the function requirement framework of the designed system. The main function of the system is to manage the process test results of online English learners, so that the learners can easily query their own process test results and obtain the analysis results of the deviation degree of their learning trajectory. The process tests here refer to the staged learning assessment and examination of learners. When teachers or instructors give evaluations to the staged learning of learners on a certain course, they need to refer to the staged learning objectives set for the learners at the beginning, that is, to check whether the process test objectives have been achieved. If such objectives have not been achieved or the learning trajectory deviation analysis result is not satisfactory, then the learning trajectory needs to be re-planned. According to the system function requirements shown in Figure 1, the relationship among system users can be one-to-one, one-to-many, or many-to-many. For one learner and one course, different learning items might be chosen due to the different learning trajectories, and the learner might establish teacher-student relationship with different teachers. Figure 2 shows the relationships among teachers, students, and learning items under the process test mode. There are students, teachers, courses, learning items, process test results of individual subjects, course test results, learning trajectory, learning trajectory deviation analysis, and other relationship modes. Targeting at these relationship modes, the following basic information tables of the de-signed system were established in SQL Server 2005. Table 1 gives the basic information table of teachers and students.

Cluster analysis
Suppose a data set G={G1,G2,…,GX} contains X sets of learner process test result data composed of Y-dimensional time series eigenvectors; gyx represents the eigenvalue of the y-th time series feature of the x-th set of data, and Gx=(g1x,g2x,…,gYx) represents the time series eigenvectors. This paper chose the classic k-means clustering algorithm to divide the N sets of test result data into K mutually disjoint clusters M={M1,M2,…,MK}, and the cluster centres of the K clusters are denoted as δk=(δk1,δk2,…,δkY). The test result data in a cluster are of high similarity, and the test result data between clusters are of low similarity; DIS(gyx,δky) is the distance between the test result data in the cluster and the centre of the cluster, it can be defined as the objective function of the cluster, as shown in Formula 1 below: where, P is a X×K matrix, which is used to describe the attribution of the cluster. ρxk is a binary function with a value of 0 or 1; if its value is 0, it means that the x-th set of data does not belong to the k-th cluster; if its value is 1, it means it belongs, and it satisfies the equation ∑Kk=1ρxk=1. The value of the distance DIS(gyx,δky) is greatly affected by the distance measurement method. If the Euclidean distance is adopted for measurement, then Formula 1 can be updated as: The processing of eigenvectors in Formula 2 ignores the differences between features, and this problem can be solved by assigning feature weights. This paper assigned weights to the time series eigenvectors from the subjective and objective per-spectives, A represents the subjective weight and B represents the objective weight. Taking the subjective weight value assignment as an example, set the subjective weights of the Y features of the time series eigenvector as A=(a1,a2,…,aY), the learner-defined parameter is represented by ε, then Formula 2 can be updated as: where, ay is also a binary function with a value of 0 or 1. Suppose VAy is the sum of the variance of feature y in the cluster, it can be calculated by Formula 4: Then, under the constraint of ∑Ky=1ay=1, solve the objective function minimization problem, and the obtained feasible solution is expressed as: Similarly, the feasible solution corresponding to the objective weight is expressed as: The comprehensive weight W=(w1,w2,…,wY) of the two weight value assignment methods can be expressed by Formula 7: The execution steps of the clustering algorithm of learner process test results are: Step1: Normalize the data of learners' process test results; Step2: Initialize the feature weights of the time series eigenvectors so that the subjective weights and objective weights of all features are the same; Step3: Randomly initialize the cluster centres of K clusters; Step4: According to the determined subjective weights, objective weights, and cluster centres, use the Euclidean distance to update the division of the K clusters; Step5: Based on the determined subjective weight, objective weights, and the newly divided clusters, calculate the mean of time series eigenvectors, and update the cluster centres; Step6: Based on the updated clusters and cluster centres, update the subjective weights and objective weights; Step7: Judge whether the algorithm converges, otherwise return to Step 4.

Analysis of the frequent pattern of learning trajectory
The frequent pattern mining algorithm of learners' English learning trajectory proposed based on process tests has the advantage of low cost. The sequence constructed by this algorithm is composed of item sets whose elements are items with time series relationships. There can be one or more item sets and items, but the item sets are ordered and items are disordered. For the time series eigenvector sequences of learners' process test result data Gt={g1t,g2t,…,gxt} and Gt+1={g1t+1,g2t+1,…,gyt+1}, if there are sub-sequence and super-sequence relationships between Gt and Gt+1, it indicates that the leaner studies more items at the next time moment and gets more process test results. At this time, there is a number sequence 1≤q1≤q2≤…≤qx≤y, which satisfies the following Formula: At this time, item set Gt+1 contains all item sets that describe the process test results in Gt, and the sequence of item sets in Gt and Gt+1 is the same.
If Gt is the prefix of Gt+1, it indicates that at the current moment, the learner continues the learning mode of the previous moment, the process test results remain unchanged; at this time, the following formula needs to be satisfied: The prefix relationship between Gt and Gt+1 is that the Gt sub-sequence can only start from the start position of the super sequence Gt+1, which mostly appears in the early learning stage of learners.
Suppose that the largest sub-sequence of Gt that satisfies Formula 8 is Gt΄, and Gt΄-1 is the prefix of Gt΄, then, at this time, Gt΄ is called the projection of Gt with respect to Gt΄-1, it indicates that the learner changes the learning pattern at time t΄, the learner studies multiple advanced items and gets more process test results. This paper set a support degree threshold Supportmin for the frequent pattern of learning trajectory; if the degree of support of the pattern to be verified is lower than Supportmin, then it cannot be output as an ideal frequent sequence pattern.
The execution steps of the frequent pattern mining algorithm of learners' English learning trajectory are: Step 1: Normalize the learner process test result data set G, and determine the support degree threshold Supportmin Step 2: Initialize the sequence trajectory pattern η and its length l, and obtain the learner process test result data G|η after the projection of η Step 3: Scan the data time series eigenvector sequence after the projection of η; if there is one item or only one item u of the last item set of η, then u is regarded as the required frequent item Step 4: Add the frequent item u to the end of η, generate and output the new sequence trajectory pattern η* Step 5: Obtain the learner process test result data after the projection of η*; at last, call the algorithm program PrefixSpan(η*l+1, G|η).
Analysis of learning trajectory deviation Based on the processing of the learners' English learning trajectory in the previous section, the corresponding trajectory frequent pattern was obtained. The test result trajectory can also be taken as one of the features of the cluster analysis of learners' English test results. However, unlike the data processed by traditional clustering algorithms, the data form of test results with time series features needs certain conversion, and at last, the distance between the learning trajectory features described by similarity can be obtained. Such conversion requires to analyse the learning trajectory features and construct learning trajectory frequent matching patterns for the different learning stages of learners.
Suppose the control parameter σ is a binary function with a value of 0 or 1. The two given trajectory patterns can be expressed by Formula 10: For the learning trajectory patterns of different learning stages, if there is a trajectory match TM={LP1,LP2,…,LPd} with a length of d, then the following three conditions need to be met: When d is equal to 1, if LP1, LP1x, and LP2y are equal, then it is considered that there is a trajectory match TM= If there is r,e∈ [1,d] that makes LPr, LP1x, and LP2y equal, at the same time LPe, LP1b, and LP2c are equal, when r is less than e, then x is less than b and y is less than c. c) If r∈ [1,d-1], then there is: The learning trajectory pattern T of a certain learning stage can be divided into Q mutually independent sub-trajectory patterns T=(QT1,QT2,…,QTQ). According to the constraint conditions, the trajectory match between Q sub-trajectory patterns could be obtained, and the corresponding trajectory match set S={S1,S2,…,SK} could be constructed, wherein Sk={Sk-DISr-DISe}. Suppose the number of calculation times that the sub-trajectory pattern satisfies the trajectory matching is Sk*={Sk-DISr-DISe} and the total number of calculation times of trajectory matching is Q(Q-1)/2, then for each Sk, by solving all Sk-DISr-DISe that satisfy Formula 12, we could get the frequent trajectory pattern under a match with a length of d.
The similarity between two trajectories is to compare the proportion of the test result data fluctuations in similar fluctuation matches in different time periods. Suppose the original sub-sets of pattern matching are represented by TS1={λ1-θ-d} and TS2={λ2-θ-d}. If the span of time interval is large, the matching results may have multiple trajectory matches of different lengths. At this time, the similarity of two trajectory patterns is the similarity combination of multiple trajectory matches of different lengths. Suppose the ID and length of a trajectory pattern match are respectively θ and d, then the similarity S of the two trajectory patterns can be calculated by Formula 13: where, the weight ratio of learning item ftw(i,j) can be calculated by Formula 14: 4 Figure 3 shows the evaluation of the clustering effect of the test result clustering algorithm used in the system. When K=5, the clustering effect of the algorithm is better. To further verify the effectiveness of the algorithm, this paper designed an experiment to compare the performance of the algorithm before and after the algorithm optimization. Figure 8 shows the experimental results under different K values. Figure 4(a) shows the experimental results of the classic k-means clustering algorithm when K=5; Figures4(b) and 4(c) respectively show the experimental results after introducing the subjective and objective weights when K=3 and K=6. According to the figures, the three clusters divided by the classic k-means clustering algorithm have similar distributions of learners who had achieved the process test objectives, each cluster has a similar number of learners who had achieved the process test objectives, but the effective information related to the learning trajectory could not be provided.

Fig. 3. Evaluation of clustering effect
After the combined subjective and objective weights had been introduced and the experiments had been performed for many times, comparisons showed that, for 9 learning items of English listening, oral English, dictation training, situational conversation, bilingual reading, vocabulary grammar, special program for domestic examinations, special program for foreign examinations, and special program for basic English, their clustering results under K=3 were obtained after the subjective weights of relevant features were added; then, according to the features of the sub-items of the learning items, each cluster was sub-divided again. Cluster 0 was sub-divided into two clusters, Cluster 1 and Cluster 2; Cluster 1 was sub-divided into two clusters, Cluster 0 and Cluster 3; Cluster 2 was sub-divided into two clusters, Cluster 3 and Cluster 4; it can be seen from the figures that good results were obtained after the sub-division operations.  Table 4 summarizes the access frequency of learning items in all learning trajectories in each cluster.
Different learners have different learning objectives. If a certain learning item occupies a too-high proportion in the learning trajectory, it will lead to problems in the learning trajectory frequent matching threshold. Some representative patterns might seem not typical enough due to the too high access frequency to certain learning items, and might be excluded due to the minimum support degree. Table 2 shows the frequency of each learning item in the learning trajectory patterns of learners in each cluster under the condition that the trajectory matching length is 1. After learning trajectory patterns were formed, the frequent matching pattern of individual learners could be obtained, the fixed similarity accounted for 30%, that is, if the proportion of the number of matching times of the pattern in the total number of matching times exceeds this value, then this pattern is considered to be the learning trajectory frequent matching pattern of the learner. Table 3 gives examples of learning trajectory matching.   This paper counted the similarity between the learning trajectory of all learners and the trajectory of the centre of the cluster to which the learner belongs. Table 4 shows the proportions of each-type learners with a similarity degree less than 30%, 20%, and 10%. Compared with the entire cluster, the learning trajectory deviation of the abovementioned learners was relatively high, there might changes in the learning patterns; lower degree of similarity indicates higher fluctuation range of the process test results. The performance test of the proposed online English test results management system was mainly carried out on the two parameters: the query response delay and the data operation delay during system operation. Figure 5 shows the test results of system delay. Figure 9(a) shows the time interval for learners or teachers to wait for the results after submitting the test result query request, and this is a direct manifestation of the effect of data mining applied to the test result management system. It can be seen from the figure that, under the condition that number of target process test result data is 100-500 pieces, taking 30 or 60 pieces of query data displayed at each time as an example, the query response delay of the system was about 3 seconds, which can satisfy the basic query requirements. Figure 9(b) shows that operations of process test result data input, edit, and delete were kept below 1.2 seconds, which can meet the basic query requirements as well.

Conclusion
This paper innovatively applied the trajectory data mining technology to an English online process test results management system. First, the paper used SQL Server 2005 to construct 4 basic information tables based on an analysis of the functional requirements of the system. Then, an improved k-means clustering algorithm and the trajectory frequent pattern mining algorithm were employed to cluster the learners' process test results and analyse the deviation degree of learning trajectory. After that, this paper designed experiments to evaluate the clustering effect of the improved clustering algorithm and compare the performance of the algorithm before and after the algorithm optimization, and the experimental results had verified the effectiveness of the improved algorithm. Moreover, the experiments had also given the trajectory matching results of the trajectory frequent pattern mining algorithm and the trajectory deviation analysis results, which had proved the application value of the proposed algorithm in the early warning of the English online education based on the fluctuation of process test results. Finally, the paper gave the detailed system function design from four aspects, including login, process test result entry, learning trajectory setting, and test result query and learning trajectory deviation analysis; and the experimental results also verified that the performance of the system was relatively ideal in terms of query response delay and data operation delay.