An Improved Apriori Algorithm for Association Mining Between Physical Fitness Indices of College Students

—The physical fitness of college students can be evaluated scientifically based on the data of physical education (PE). This paper firstly relies on the Apriori algorithm to mine the hidden correlations between the physical fitness indices from the PE data on college students, and identify the indices closely associated with the physical fitness of college students. Then, the Apriori algorithm was improved to reduce the time complexity of association rule mining. Based on the improved algorithm, it was learned that the correlation coefficients of several indices surpassed the minimum support of 0.2 and minimum confidence of 0.7, reflecting their important impacts on physical fitness. Thus, physical fitness of college students is significantly influenced by speed, endurance, flexibility, and vital capacity, but not greatly affected by height and weight. The research results provide an important guide for the test and curriculum designs of PE for college students.


Introduction
With the progress in data mining technology, there has been a steady growth in the volume of data on physical education (PE) of college students. This trend is expected to continue in the foreseeable future. The massive PE data suggest a continuous decline in physical fitness among college students in China [1,2], posing a threat to their learning performance and daily life.
Against this backdrop, it is important for every college to step up the monitoring and performance evaluation of PE. However, the traditional data processing methods cannot effectively evaluate the physical fitness of college students, due to the sheer volume of the relevant data. As a result, PE experts are faced with the new task to mine the correlations between fitness indices of college PE.
If the above task is solved effectively, it will be possible to evaluate the physical fitness of college students based on the data of PE. Efficient and effective evaluation of the PE data on college students helps colleges to formulate better a PE curriculum to enhance the physical fitness of their students, and enables decision-makers to optimize their decisions concerning PE implementation.
To identify the key factors affecting the physical fitness of college students, this paper mines the correlations between physical fitness indices out of mass PE data, with the help of the association rule mining algorithm called the Apriori algorithm, and also identifies the most important indices of physical fitness. Besides, the Apriori algorithm was improved to reduce the time complexity by the divide and rule principle. Finally, the indices identified by the improved algorithm were compared with those obtained by the original algorithm. The research findings provide a good reference for improving the physical fitness of college students.

Literature Review
In the field of education, many colleges rely on data mining to assist students in course selection, evaluate their development, innovation ability, and entrepreneurship, and predict their post-graduation growth.
Joksimović et al. [3] applied data mining to manage student achievements, laying a solid basis for improving teaching quality. Sun and Bin [4] mined some hidden information that affect the computer grade examination results of a college, and provided an important guide for computer teaching. Jen et al. [5] improved the Apriori algorithm, and combined it with preference information to mine and analyze student scores. Kumar et al. [6] explored the correlations between academic performance and habits, and thereby prewarned the abnormal situation of students.
Through fuzzy k-means clustering, Zhen [7] captured deep knowledge from teaching data samples, and evaluated the academic performance of students. Ishman et al. [8] explored the cumulative data on teachers and student evaluations with the Apriori algorithm, and revealed the frequent problems in the teaching process. Drawing on the data mining theory of rough set, Tican and Taspinar [9] examined teacher behaviors and teaching effect of experimental courses in colleges, and identified the factors that constrain the development of learners. Liu et al. [10] probed deep into the problems in current teaching evaluation method, conducted an experimental analysis with association rule algorithm, rough set algorithm, etc., and constructed a multi-element teaching theory.
Based on data mining, Ji et al. [11] set up a personalized distance education system, and provided personalized distance education services, which cater to the needs of each student. Luo et al. [12] derived an overall development model from the historical performance of students, and created a mining model to predict their future performance and prewarn performance declines. With the aid of association rule algorithm, Yu [13] acquired the data related to student learning, identified the association rules that influence learning effect, and pinpointed the factors with a strong correlation with learning effect.
Based on the consumption data of campus smart card, Liu et al. [14] adopted statistics and social network research methods to obtain the features of student communication behavior. By association rule algorithm, Scherer et al. [15] dug into the information of student achievements, quantified the association rules between courses, and forecasted the number of students who could not graduate normally. Fan et al. [16] explored the techniques of association rule mining, proposed the optimized Apriori algorithm, and applied it to find the correlations between the main factors affecting student performance.
Aher and Lobo [17] combined clustering algorithm and association rule algorithm to mine and analyze course scores, and thus discover the factors affecting student performance. Kotsiantis et al. [18] improved the incremental mining algorithm based on the degree of learning interest, and then obtained reliable and reasonable association rules of course structure, providing a theoretical basis for course recommendation. Zhou et al. [19] presented a personalized course recommendation algorithm, and employed it to recommend computer courses; the results show that their algorithm can recommend courses as per the needs of different students, after analyzing the courses associated with the promotion of student scores.

Correlation Analysis Based on Apriori Algorithm
One of the latest trends in PE research is to mine the hidden relations between physical fitness indices out of a huge number of PE data on college students. This paper firstly constructs and tests a correlation analysis model of PE data based on the Apriori algorithm.
The Apriori algorithm aims to mine association rules with the help of frequent itemsets [20][21][22][23][24]. The basic idea of the algorithm is: first, find all frequent itemsets, whose support is greater than or equal to the predefined minimum support; then, identify the strong association rules from the frequent itemsets, which must satisfy both minimum support and minimum confidence; next, generate the rules that only contain the items in the corresponding set; after that, retain only the rules whose confidence is greater than the user-defined minimum confidence. The specific flow of the Apriori algorithm is shown in Figure 1.
Frequent itemsets refer to the itemsets with a support higher than the minimum support threshold. The mining of frequent itemsets is the cornerstone of data mining. Association rules can be mined from these itemsets.
Association rule mining, a rule-based machine learning approach, seeks for relations of interest in a large database. The purpose of association rule mining is to identify the strong rules in the database, with the aid of some indices.
The strength of an association rule can be measured by its support and confidence. Support means the frequency of an itemset or rule appearing in all items. The support σ(A) of itemset A can be calculated by: where, N is the total number of itemsets. The support of rule A → B can be defined as: Thus, the support of A ==> B indicates the probability that itemsets A and B appear at the same time: Confidence refers to the probability of itemset B appearing at the same time with itemset A in a transaction T: Based on the physical indices and the principle of the Apriori algorithm, the minimum support and minimum confidence were set to 20% and 70%, respectively. After data pre-processing, the Apriori algorithm was adopted to handle the data. Firstly, the list of itemsets was scanned through to remove the itemsets that violate the minimum support threshold. The remaining itemsets were merged into a two-element set of items. Next, the transaction records were scanned again to delete the itemsets that violate the minimum support threshold. The above operations were repeated until all itemsets were removed. Table 1 shows the correlations between the indices obtained by the model from the actual data. The following conclusions can be drawn from the table: 1. The support and confidence of the correlations between index 2 and indices 6 and 7 were greater than 0.2 and 0.7, respectively, indicating that weight is correlated with push-up test result and sit-up test result. 2. Index 2 is not significantly correlated with indices 8 or 9. 3. The support and confidence of the correlations between index 3 and indices 4-8 were above the minimums of 0. The above results demonstrate the importance of endurance in the PE, for speed, flexibility, and vital capacity are important indices of physical fitness.

Correlation Analysis Based on Improved Apriori Algorithm
Despite its good mining effect, the Apriori algorithm has a low efficiency. It consumes too much time to process a large dataset, owing to the need to traverse the data multiple times. To lower the time complexity, this section improves the Apriori algorithm, and applies the improved method to verify the factors affecting the physical fitness of college students.
The Apriori algorithm was improved by adopting the divide and rule principle. Following this principle, the transaction dataset was recursively split into several small conditional transaction datasets, facilitating the mining of frequent itemsets. After the improvement, the algorithm only needs to traverse the transaction dataset twice, and eliminates the need for generating candidate sets. In this way, it is one order of magnitude faster than the original algorithm.
The improved algorithm first sorts the items in the transaction dataset by support, inserts the items of each transaction in descending order to a frequent pattern (FP) tree, with null as the root node, and records the support at each node. Figure 2 provides an example of the FP-tree.

Fig. 2. An example of the FP-tree
The FP-tree was constructed and projected iteratively. For each frequent item, a conditional projection database and an FP-tree were constructed. This process was repeated for every newly constructed FP-tree, until the tree was empty or contained only one path. If the tree was empty, its prefix was taken as the FP; if the three contained only one path, the FP was obtained by enumerating all possible combinations and connections with the prefix of the tree. Figure 3 gives the flow of the improved Apriori algorithm.

Fig. 3. Flow of the improved Apriori algorithm
Next, the improved Apriori algorithm was applied to filter the original data and analyze the correlations between physical fitness indices.
The correlations of the nine rules between all physical fitness indices are presented in Table 2. The following can be inferred from the data in this table: 1. Gender (index 1) was associated with most of the other indices. The support and confidence of the correlation between this index and other indices were greater than 0.2 and 0.7, respectively. The association rules indicate that the correlations are not very strong. 2. The support and confidence of the correlation between index 2 and indices 1 and 3-7. Thus, height is correlated with weight, vital capacity, push-up test result, sit-up test result, sit-and-reach test result, and standing long jump test result. The support between index 2 and index 8 was greater than 0.2, but the confidence between them was smaller than 0.7. This means height is not strongly correlated with 50m dash test result. Moreover, indices 2 and 9 have a certain correlation, suggesting that height is associated with 1,000m/800m running test result. 3. The support and confidence of the correlation between index 3 and indices 4 and 5 were greater than 0.2 and 0.7, respectively. Hence, weight is correlated with vital capacity and sit-and-reach test result. Meanwhile, support and confidence of the correlation between index 3 and indices 7 and 8 were also greater than 0.2 and 0.7, respectively. This means weight is correlated with push-up test result, and sit-up test result. In the analysis of association rules, if the number of variables in the analysis problem is large and there is a complex relationship between variables, principal component analysis can be used to simplify the data set, which will help to analyse and model the problem.
Principal component analysis (PCA) is a data dimension reduction method based on sample statistics and covariance minimization theory. Its basic idea is to simplify the complex relationship between independent variables. Its method is to transform multiple variables into a few comprehensive variables through matrix decomposition. Each principal component is a linear combination of the original variables, and each principal component is not related to each other. Therefore, these principal components can reflect most of the information of the original variables, and the information contained is not overlapping.
Assuming that there are T original independent variables, PCA extracts information from all independent variables and integrates them into T variables with different importance. Suppose that the original sample has T attribute values, and there are n samples in total, then we can get a × matrix as follows: After standardization, the correlation coefficient matrix of the data is: The eigenvalues ( 1 , 2 , … , ) and corresponding eigenvectors of the correlation coefficient matrix are calculated. The important principal component is chosen and the principal component expression is written. Finally, the original data after standardization is replaced by the expression of principal component, and the scores of each principal component can be obtained. The specific forms are as follows: To sum up, the physical fitness of college students is significantly influenced by speed, endurance, flexibility, and vital capacity, but not greatly affected by height and weight.
Further, the improved Apriori algorithm was compared with the original algorithm in terms of performance. Figure 4 contrasts the execution time of the two algorithms under different minimum supports. It can be seen that the improved Apriori algorithm had a much lower time complexity than the traditional algorithm. Therefore, the proposed algorithm is more suitable for analysing the mass PE data on college students.

Conclusion
This paper mainly analyses the PE data on college students with the original and improved Apriori algorithms, and mined the correlations between the physical fitness indices. The results show that the height and weight of college students in China are increasing, but the overall physical fitness is on the decline; the physical fitness of college students is significantly influenced by speed, endurance, flexibility, and vital capacity, but not greatly affected by height and weight; the physical fitness of college students can be improved by providing scientific trainings on speed, endurance, flexibility, and vital capacity. The research results provide an important guide for PE teachers and designers of PE curriculum.