Analyzing Student Performance in Programming Education Using Classification Techniques

— In this research, we aggregated students log data such as Class Test Score (CTS), Assignment Completed (ASC), Class Lab Work (CLW) and Class Attendance (CATT) from the Department of Mathematics, Computer Science Unit, Usmanu Danfodiyo University, Sokoto, Nigeria. Similarly, we employed data mining techniques such as ID3 & J48 Decision tree algorithms to analyze the data. We compared these algorithms on 239 classification instances. The experimental results show that the J48 algorithm has higher accuracy in the classification task compared to the ID3 algorithm. The important feature attributes such as Information Gain and Gain Ratio feature evaluators were also compared. Both the methods applied were able to rank search methods. The experimental results confirmed that the two methods derived the same set of attributes with a slight deviation in the ranking. From the results analyzed, we discovered that 67.36 percent failed the course titled “Introduction to Computer Programming”, while 32.64 percent passed the course. Since the CATT has the highest gain value from our analysis; we concluded that it is largely responsible for the success or failure of the students. Recommendations were given on how to improve the failure rates in the future.


Introduction
research area that tends to improve the learning outcome of students by developing methods to analyze and detect patterns and to infer changes with an ultimate goal-to improve learning. The application of data mining (DM) in educational settings gave rise to the field of Learning Analytics [2]. Recently, there has been an exponential research growth for harnessing and utilizing data mining techniques for scientific research in educational settings giving rise to a field known as educational data mining [3]. Educational data mining simply refers to the process by which new techniques are been developed to discover data emanating from educational settings which are then used to understand the behavior of students and the environment they learn in. [4]. The different techniques used for data mining can be employed for educational data mining (see Figure 1). Similarly, various classification algorithms are explored in data mining. Figure 2 depicts the various classification algorithms used in data mining. Educational data mining also explored the same algorithms to find hidden information from the datasets and also may be applied in predicting at-risk students and preventing dropouts. Studying computer programming as a course is both challenging and daunting and the few privileged students who studied the course found it uninteresting after some time. The success rate in average for the first introductory programming course, denoted CS1, has been estimated to be 67% worldwide [5]. To motivate students to succeed and master a programming course, several researchers have become interested in looking for factors that can make the teaching and learning of computer programming interesting. In particular, computer scientist has been researching for ways to explore various features, behavior and performance of students for the sole purpose of identifying weak students. For instance, the use of mobile learning to support students in programming education has been explored by Oyelere et al. [6].  Similarly, the use of context-aware and adaptive system, called smart learning environment for programming education [7] was proposed to enhance students learning experience. The smart learning environment introduces a blend of formal and informal learning in which the learner can learn from any location, based on learning preference and context. Recently, researchers have adopted a paradigm shift to a more data-driven approach by studying and analyzing programming patterns and behavior of students. This includes patterns in programming and compilation states thus, making it more efficient and effective at reflecting the effort and progress of the students for the entire course duration.
This paper studies the performance of the student's in a programming course using data mining techniques. Data mining techniques are frequently used to study and subsequently analyze the performance of students in programming by utilizing the rich tasks it provides. We employ the classification task in this research to evaluate and subsequently improve student's performance in programming. We also employ the decision tree method of data mining in this paper. This is as a result of its high accuracy level for predicting student performance [8].
The challenges that students encounter in a programming course has been topical and created concern for educators and researchers in the recent time. Efforts to make the programming easier to learn have been explored at different context. For instance, the use of games and gamification [9], puzzle-based techniques [10], and other peda-gogical approach are explored to support students for a better learning experience. In the context of Nigeria, the use of mobile learning system-MobileEdu-puzzle [11], shows that the learning experience of students in introductory programming class were enhanced. Although, these efforts exist, it is important to further analyze student's performance in a programming course, using some variables as factors that are capable of impacting their grades. Important information such as attendance of student continuous assessment, class quizzes, and marks are useful for such analysis. In other to achieve the objectives of the study we considered two research questions. RQ1: What fine-grained programming logs data should be aggregated to study their effect on student's performance? RQ2: How will the performance of students be analyzed? This research aimed at mining educational data to analyze student's performance in the introductory programming course in Nigeria's context. The following are the objectives of this study:  Aggregate fine-grain log programming student's data in order to study their effect on the performance of students  Employ data mining techniques such as Decision tree algorithm to analyze students' performance  Compare ID3 and J48 algorithms result in the instance classification.

Related Work
Data mining can be defined according to [12] as the process of extracting inexplicit, unknown and useful information from data. Mining data is often used in finding structural patterns in data, it forms a strong basis in making predictions. Educational data mining, however, deals with the methods and techniques for extracting knowledge from educational data. Although this research area, however nascent, is beginning to gain popularity by the day. At present, there are many ongoing kinds of research in this area. This is due to its potential in educational institutions. A 1995 and 2005 survey about educational data mining presented by [2], found out the importance of mining educational data and hence, encouraged researchers to explore this nascent field. They concluded by discovering some specific requirements not presented in other domains. A case study presented by [13] demonstrated the importance of mining educational data in higher education especially for the improvement of graduate student results. In their research, the data set from the College of Science and Technology in Khanyounis from 1993-2007 was used. Knowledge discovery was achieved through the application of data mining techniques. In particular, they were able to discover association rules which were then sorted using lift metric. Data mining classification methods such as Naive Bayesian and Rule induction were used to predict the performance of graduate students. A recent survey on mining educational data carried out by [14] proposed a taxonomy of tasks in educational data mining and moved further in their research by grouping similar applications into sub-categories and categories respectively. They concluded their research by reviewing existing surveys and books about educational data mining.
A research carried out by [15] titled Modeling student performance using data mining; they developed software for mining educational data to improve the rate at which students succeed using student profiling. The application made use of the data that was generated from the university domain. According to their findings, the success of the system was slightly distorted because the data set expected to be in some columns were not there. They were, however, able to deduce that increasing the number of variables and the amount of data will lead to better predictions about the success rate of students.
A systematic review of existing literature on predicting student's performance based on the techniques and methods of data mining was conducted by [16]. The researchers highlighted the data mining methodologies used for the prediction of student performance. More importantly, they focused on the algorithm and how it can be used to discover the most important feature in student data. The classification task was used to evaluate the performance of students in research [17]. In particular, they explored and used the decision tree method of data mining majorly due to its popularity and simplicity. By using this method, they were able to extract hidden knowledge that describes the performance of students in the final semester examination. The research helped in discovering students' dropout and more importantly those students in need of special attention and intervention such as advising/counseling.

Research Design
In this research, we will employ the data mining processes that include data preparation, preprocessing, data selection, data transformation, data mining and evaluation (Fig. 3).

Data preparation
We extracted our data from the first-semester lecture of computer programming course (CSC 201) which spanned three months from September -November 2017 at Usmanu Danfodiyo University, Department of Mathematics, Computer Science Unit. The course, titled "Introduction to computer programming" teaches students the introductory aspect of programming; its syntax and semantics amongst other salient features. Computer Science majors are required to take the course while it is optional for some departments. The course material consists of three parts with each part corresponding to 1 month of the course. Two hours of weekly lectures were always provided by the teaching staffmainly lecturer II including 20 hours of weekly support in the computer laboratory. The course outline contains for example, an overview of java programming, data types, variables, and arrays operators, control statements java classes, inheritance, packages and interfaces, etc. Two example programming tasks and one version of the solution is presented in Table I. At the end of each semester, the students were graded accordingly. The grading is as follows: Continuous assessment (20% of the total score), Assignments (5% of the total score), student attendance (5% of total score) and exams (70% of total score). For a student to pass a particular course, such students should possess at least 40% of the total score. The participants who participated in the final exams totaling 239 provided data for this study.

Selection and transformation of data
Here, we selected the required fields for data mining. Furthermore, we gave an overview of all the response variables as well as the predictor variable for reference purpose (Table II).

Using decision tree algorithm
Because of its powerful features, this algorithm is widely used for classification and prediction in both machine learning and data mining. One of the advantages of choosing this method is because the decision tree represents rules in contrast to using neural networks. Humans can readily understand and interpret these rules because of its simplicity and comprehensibility to uncover large or small data structure and predict them [18].
Decision tree usually is a flowchart classifier just like the tree data structure where  A test on an attribute is denoted by a non-leaf node  An outcome of the test is represented by the tree branch  A value of the target attribute indicates a terminal node  The topmost node in a tree is the root node We decided to choose the decision tree algorithm because of the following strong features:  High dimensional data can be easily handled with the decision tree  Small-sized trees can easily be interpreted  The steps to be followed to properly classify decision tree induction are fast ID3 decision tree: To build our decision tree, we would use an algorithm developed by [19] known as ID3, which has been the primary algorithm from which decision trees are constructed. This algorithm makes use of a top-down, greedy search method to search through the space of possible branches with no provision of backtracking. A tree based on the information gain is constructed. This information is obtained from the training instances, which is then used to classify the test data [20].

Attribute selection measures
This measure is responsible for determining the procedure to be followed in splitting the tuples at a given node. It also provides a ranking for every attribute that is involved in the description of the training tuples. To choose the splitting attributes for particular tuples, we consider the attributes with the highest score.
-Information gain: This attribute selection measure is frequently used for selecting an attribute among the various attributes at each step while building the tree. To calculate the homogeneity of a sample, the ID3 algorithm employs a mechanism called entropy to calculate the homogeneity of a sample. The entropy for a sample that is homogenous is considered to be zero. For an equally divided sample, the sample remains one. We define the binary classification of an entropy of a set S (s contains only positive and negative examples) as: In this case, the proportion of S belonging to class is . The reduction in entropy caused by partitioning the examples according to this attribute is measured using Information gain.
In this paper, we define 0log0 to be 0 in all the calculations involving entropy.
= information needed before splittinginformation needed after splitting In this case, the subset of S is Sv for which attribute A has value v (i.e., Sv = {s E S | A(s) = v}) and Values (A) comprises all possible values for attribute A C4.5 uses gain ratio to split the training data set S into various partitions in order to normalize the information gain using defined value as: The value above represents the information that corresponds to n outcomes of a test on the attribute A. The gain ratio is defined thus; The highest gain ratio value was selected as the splitting attribute [9]. Relevant attributes are those attributes in the decision tree with non-leaf nodes. We therefore stated the algorithm of our decision tree as follows: i) An attribute will be selected if it best differentiates the output attribute. ii) For every selected attribute, create a separate tree branch. iii) Create subgroups from the instances in order to be a reflection of the attribute values of the selected node. iv) The attribute selection process should be terminated if: a) There exists a value that is identical for the output attribute for all members of a subgroup, hence the process for selecting attribute for the current path should be terminated and the branch on the current path with the defined value should be properly labelled. b) No further distinguishing in the single node can be determined or there exist a subgroup containing a single node. The branch with the output value that is seen by majority of the remaining instances should be labelled as in (a) above. v) Repeat the above process for every subgroup in (iii) that is not a terminal.

Results
We present the data set containing the results of n = 239 students in programming courses (CSC 201) which were obtained from the Department of Mathematics, Computer Science Unit Usmanu Danfodiyo University (UDUS), Sokoto-Nigeria. The dataset used in this study covers 2016/2017 academic session (Table III presents sample data and Table IV present the frequency of the occurrence of each grade).  In order to know which node to use as our tree node; we need to mine the logs data. We therefore calculate the information gain but first we calculate the entropy of the various attributes. The information gain for A relative to S was calculated by first calculating the entropy of S. We presented S here as a set of 239 sets comprising the following: We determined the attribute that is best for a particular node by using information gain.   We therefore use CATT as the root node due to its highest gain value.

Fig. 4. Root Node -CATT
We also select attribute using the gain ratio, after split information must have been calculated. Hence, split information is represented in table VII. The gain ratio is presented in Table VIII.    Table IX, two different feature selection methods were applied. They were Information Gain and Gain Ratio feature evaluators. Both the methods applied rank search method. The most influential features found were CATT, ASC, CTS, FSG and CLW. The experimental results confirmed that the two methods derived the same set of attributes with a slight deviation in ranking. Total number of instances considered for the classification task was 239. Table X compares the J48 and ID3 algorithm results. In the J48 classification, the correctly classified instances were 208 and the incorrectly classified instances were 31. The accuracy was found to be 87.02%. The mean absolute error was 0.0563 and the root mean squared error was 0.1779. While the relative absolute error was 32.0058% compared to the root relative squared error was 60.4193%. Whereas in the ID3 classification, the correctly classified instances were 204 and incorrectly classified instances were 35. The accuracy was found to be 85.35%. The mean absolute error was 0.0511 and the root mean squared error was 0.1874. While the relative absolute error was 29.4579% compared to the root relative squared error of 64.3234%. The decision tree method makes use of attribute selection measure like gain ratio and information gain equations (2) and (4) which were discussed in section 3.4. Having established the parameter for splitting the various attributes, we then employed the use of the Weka classifier [21] to create a visualization tree of the J48 classification result. This classifier was able to reduce the over fitting associated with building a decision tree as well as pruning. For instance, out of five distinct attributes, CATT, and ASC attributes were only shown, the remaining attributes like CTS, FSG and CLW were pruned from the tree based on attribute selection measure of information gain. Attribute is represented by ellipse whereas actual attribute values are represented in rectangular box with specification of numbers. It is also posited that the top most node represents the root node and is labeled CATT from figure 6 above. This node was automatically selected by the J48 classifier following by the ASC node.

Discussion
This section will discuss the analyzed result on student performance. This analysis is based on the highest accuracy of classification methods and also the main important factors that may influence the performance of students. Table X shows the classification accuracy of both ID3 and J48 classification methods. From the table, it is obvious that J48 has higher accuracy as compared with ID3 technique by 1.6644%. By looking at Table IV, a higher percentage of the students failed the course titled Introduction to Computer Programming by 67.36%. This is strongly connected with the avoidance of lectures by the students as represented in Table VI. In this table, the Class Attendance (CATT) attribute has the highest information gain making it a suitable candidate as the root node. The next highest attribute is the Student Assignment (ASC) as shown in the same table with the next highest information gain value. This was further collaborated in Figure 6 in which the Weka classifier was automatically select CATT as the root node in the tree followed by the ASC. One unique feature of this classifier is the ability to automatically perform both pruning and over fitting of the various attributes. We also represented the classification result in the form of If-Then rules as represented in figure 5. One strong feature of the decision tree is its ability to represent information in the form of if-then rules [17]. The result clearly shows that class attendance plays an essential role in enhancing student's performance, as 75% attendance is compulsory in order to write the exams. However, student might also write the exam if the attendance is below 75% on genuine reasons such as sickness, accident, etc. Such reasons however, must be backed by a valid document. Any student found wanting in this aspect is awarded a failed (F) grade. Probably the reasons why student were absent in class might be connected to the difficult nature of programming courses, "novice programmers find it difficult to remember and correctly apply programming language vocabulary, logic, errors, syntax, semantics, and styles" [22], or contextual reasons [23].

Conclusion and Future Work
The paper shows the usefulness of data mining particularly in the domain of tertiary education in analyzing the performance of undergraduate students. Data was gathered from Usmanu Dandodiyo University Sokoto, Department of Mathematics, Computer Science Unit, introductory programming course. The various techniques of data mining were applied in discovering hidden knowledge. In particular, we used the decision tree method to analyze the data set in which we discovered that the Class Attendance (CATT) plays a major role in determining the success or failure rate of students. This study will be of benefits to both lecturers and students. In particular, struggling student will be easily identified through his/her attendance rate in class.
Furthermore, intervention techniques such as counseling, advising will come in handy for students who feel they can pass the course without attending the class. This will go a long way in boosting the academic performance of students. The future direction of this research will look at the possible factors that lead to the massive failure of students in programming courses in Usmanu Danfodiyo University, Sokoto, Nigeria with a view to ameliorating the problem. It has however been established from this research that avoidance of lectures by the students is largely responsible for the massive failure of programming courses. We tend to explore this further in subsequent research to know the reasons why student avoid programming class. For example, we pose the following research question for future study: could absence of students in programming class be due to the daunting nature of programming or the way it is been taught? Answering this question in the future will provide a clue or clear picture on how to minimize the failures.