Acknowledging iJOE 2017 Reviewers

Educational tool is one of the prominent solutions for aiding students to learn course material in Information Technology (IT) domain. However, most of them are not used in practice since they do not properly fit student necessity. This paper evaluates the impact of an educational tool, namely PythonTutor, for completing programming laboratory task regarding data structure materials. Such evaluation will be conducted in one semester by implementing a quasi-experimental design. As a result, six findings can be deducted which are: 1) PythonTutor might positively affect student performance when the students have used such tool before; 2) Sometimes, student perspective regarding the impact of educational tool is not always in-sync with actual laboratory result; 3) the impact of PythonTutor might be improved when similar data representation is used consequently for several weeks; 4) the correlation between the use of PythonTutor and student performance might not be significant when the control and intervened group share completely different characteristics; 5) the students might experience some difficulties when they are asked to handle a big task for the first time; and 6) the students might be able to complete a particular weekly task with a promising result if the students have understood the material well. Keywords—quasi-experimental design, empirical evaluation, program visualization, educational tool, laboratory session


Introduction
According to the fact that students are one of the most influential resources in University, emerging issues on such domain are focused as research topics. Some of them are tracing alumni [1], predicting student outcome [2], detecting plagiarism among student's assignments [3,4], and aiding students to learn course material [5]. Among these mentioned issues, we would argue that the latest one is the most urgent issue since it affects student learning performance directly.
For aiding students to learn course material, the use of educational tool has been proved to be effective, especially in Information Technology (IT) domain where most of the course materials involve abstract representation [5,6]. However, in most cases, such tools are not applicable to be used in real learning environment since they do not iJOE -Vol. 14, No. 2, 2018 fit student necessity [7]. We would argue that one of the main reasons for such unfitness is the lack of empirical evaluation on real learning environment. Consequently, this paper is intended to mitigate the gap by proposing an empirical evaluation of an educational tool. To be specific, we want to evaluate the impact of PythonTutor (i.e. an educational tool to learn programming) for completing data-structure laboratory task in general and specific perspectives in regard to student grade. The findings of this work are expected to provide a brief insight into IT lecturers who plan to incorporate PythonTutor as their supplementary learning tool.

Related Works
Although IT is a prominent major for undergraduate students nowadays, some IT students experience difficulties for learning IT materials, especially algorithm [8,9] and programming [10,11]. They feel that learning such materials is not a trivial task since most concepts are abstract and require high logical thinking for further understanding. Consequently, to handle such issue, several educational tools for learning both algorithm and programming materials are developed. They are referred as Algorithm Visualization (AV) and Program Visualization (PV) tools respectively. On the one hand, AV tools are focused on providing a brief concept of how standard algorithms work without discussing the implementation [12]. It usually relies on interactive visual and animation in order to keep the user's attention. VisuAlgo [12] and AP-ASD1 [13] are two examples which fall into this category. On the other hand, PV tools are focused on visualizing and animating program aspects based on its runtime execution [14]. It usually displays all information stored on a program in a debug-like manner. Jeliot 3 [15], JIVE [16], VILLE [17], and PythonTutor [18] are several examples which fall into this category.
PythonTutor is a PV tool that is initially aimed at assisting students to learn programming with Python [18]. Unlike other PV tools, PythonTutor is designed as a web-based application with responsive UI. It can be accessed from anywhere as long as the students are connected to the internet. In addition, it can also be used on various machines such as personal computer, laptop, tab, or smartphone.
Based on the fact that several PV tools have been evaluated on real programming courses to measure their effectiveness comprehensively [19,5], this paper proposes a quasi-experimental design to evaluate the impact of PythonTutor for learning programming in laboratory sessions using students' grade. To our knowledge, it is the first attempt that discusses such impact on given conditions. For our case study, students from 4 classes of Basic Data Structure (BDS) course are considered as our participants. They are asked to use such tool for completing their laboratory task in half of the semesters while experiencing the absence of such tool on the other half semester. Their laboratory results are then used to statistically evaluate PythonTutor's impact for completing programming laboratory task regarding data structure materials in general and specific perspectives.
It is important to note that our work is different with works proposed in [20] and [21] that also evaluate the impact of PythonTutor. On the one hand, our work is dif-

156
http://www.i-joe.org Short Paper-A Quasi-Experimental Design to Evaluate the Use of PythonTutor on Programming Labo… ferent with a work proposed in [20] since our work is more focused on laboratory session instead of theory session. Further, the work in [20] is focused on introductory course while our work is focused on basic data structure course. On the other hand, our work is different with a work proposed in [21] since our work evaluates the impact of PythonTutor through student performance rather than student perspective. Our work will complement the work in [21] by providing a more-objective result (the result of work in [21] is rather subjective since it relies on questionnaire survey toward the students).

Methodology
Evaluation will be conducted by performing a quasi-experimental design [22] in 14-lecturer-weeks laboratory sessions of Basic Data Structure (BDS) course. The participants are first-year students who take BDS course on the even semester of 2016/2017. They will be split into two groups before conducting the evaluation. Since there are four classes of BDS course during that semester, we assign two classes for each group. Class A (15 students) and B (10 students) are assigned to the first group, which will act as an intervened group for even weeks, whereas class C (19 students) and D (18 students) are assigned to the second one, which will act as an intervened group for odd weeks. As a result, group 1 consists of 22 students while group 2 consists of 34 students.
One of the predefined groups will be assigned as an intervened group while another one will be assigned as a control group alternately during given laboratory sessions. For odd-week laboratory session, group 1 will be assigned to the control group and group 2 will be assigned to the intervened group. For even-week laboratory session, it will work in reverse: group 1 will be assigned to the intervened group and group 2 will be assigned to the control group. If a group is assigned to a control group, students in that group should complete their laboratory task in a conventional manner (i.e. without using PythonTutor). Otherwise, students in that group should complete their laboratory task with the help of PythonTutor if necessary.
Each group will get similar laboratory tasks for each session; each task should be completed in 80 minutes. The detail of each task, including its assigned intervened group, can be seen in Table 1. Most given laboratory tasks are about implementing data structure concepts in Python programming language to solve problems. Some weekly tasks are split into several smaller sub-tasks to mitigate the difficulty of learning such task. However, the total score for each weekly task is still 100, regardless of how many sub-tasks are involved for each week.
After 14-lecture-weeks laboratory sessions have been conducted, students' laboratory grades will be collected. Those grades will be further analyzed to evaluate Py-thonTutor's impact for completing programming laboratory task regarding data structure materials in general and specific perspectives. On the one hand, for the general perspective, the impact is measured by comparing the result of intervened sessions toward the un-intervened ones on the same group. This evaluation will be conducted based on paired t-test. On the other hand, for the specific (i.e. lecture week) perspec- tive, the impact is measured by comparing the result of the intervened group toward the control group (i.e. a group that is not intervened by PythonTutor) for each data structure material. This evaluation will be conducted based on unpaired t-test or Mann-Whitney U test.

General Overview of Student Laboratory Results
The statistics of student laboratory results for each group in one semester can be seen in Figure 1. The horizontal axis represents lecture weeks while the vertical axis represents resulted score. Mean refers to the average score for submitted assignments, SD refers to the standard deviation of the score of submitted assignments, and n refers to the number of submitted assignments. All components are based on submitted assignments for each lecture week for that group. According to Figure 1, there are two findings that can be deducted. First, the students might feel some difficulties when they are asked to handle a big task for the first time. Such finding is deducted from the fact that both groups achieved the lowest mean score on the 4 th week, the first week on that semester where the students were asked to handle a big task on laboratory session. Second, the students might be able to complete a weekly task with a promising result if the students have understood the material well. Such finding is deducted from the fact that the result of each week varies and some of them are higher than 80 of 100.

The Results of Evaluating PythonTutor's Impact in General Perspectives
In this evaluation, PythonTutor's impact is measured by comparing the result of intervened sessions toward the un-intervened ones for the same group. In other words, for each group, the average result for the odd weeks will be compared to the average result for the even weeks. Such comparison will be measured based on paired t-test where the result for each group can be seen in Figure 2. The horizontal axis represents evaluated groups while the vertical axis represents resulted score. The p-value for the group 1 is 0.0176, while the p-value for the group 2 is 0.0004. In general, both groups experienced statistically significant difference when learning data structure materials using PythonTutor. Such finding is deducted based on their generated p-value where both are lower than 0.05.
For group 1, the use of PythonTutor is negatively correlated with student performance regarding student grade. Such finding is deducted from the fact that the mean score for the intervened sessions is lower than the score for the un-intervened ones. We would argue that such negative correlation is caused by the fact that most students on the group 1 had never used the tool before they took BDS course [20]. Therefore, they might feel that such tool mitigated them for completing their task, considering that PythonTutor UI is not intuitive enough for students [20], [21].
For group 2, the use of PythonTutor is positively correlated with student performance regarding student grade. Such finding is deducted from the fact that the mean score for the intervened sessions is higher than the score for the un-intervened ones. We would argue that such positive correlation is caused by the fact that most students from group 2 had used the tool before they took BDS course [20]. They might have been adapted to such tool, resulting in a positive impact regarding the use of Python-Tutor. This finding is supported by p-value which is less than 0.01.

The Results of Evaluating PythonTutor's Impact in Lecture Week Perspective
In this evaluation, PythonTutor's impact is measured by comparing the result of the intervened group toward the control group per lecture week. Such comparison will be measured based on unpaired tests. If scores involved in such comparison are normally distributed, an unpaired t-test will be selected as our evaluation metric. Otherwise, Mann-Whitney U (MWU) test will be used. The result of this evaluation can be seen in Table 2. Improvements for each week is generated by subtracting mean score from the intervened group with the mean score from the control group. Figure 3 shows the improvement of each lecture week of unpaired tests. Horizontal axis refers to lecture weeks while vertical axis refers to resulted improvement degree. From 14 weeks, only 2 weeks (bolded in Table 2) show that PythonTutor affects student performance significantly (p-value < 0.05). These weeks are the 1 st and 12 th week.
On the one hand, the 1 st week statistically generates 2.44 mean reduction. Hence, it can be stated that the use of PythonTutor negatively affects student performance for completing laboratory task on that week. This finding contradicts the result proposed in [21] which stated that most students from group 2 (a group which acted as the intervened group on the 1st week) felt that the use of PythonTutor affected the most on that week. When discovered further, sometimes, it is natural that such contradiction exists, considering that student perspective and the actual result might not always be in-sync to each other.
On the other hand, the 12 th week statistically generates 8.86 mean improvements. Hence, it can be stated that the use of PythonTutor positively affects student performance for completing laboratory task on that week. When discovered further, such finding is natural since a double-pointer linked list is a simple expansion of the stand-

160
http://www.i-joe.org ard linked list, a data structure that had been learned on several previous weeks beforehand. We would argue that, at that time, the students from group 1 (a group which acted as the intervened group on the 12 th week) had been adapted to the visual representation of standard linked list on PythonTutor, resulting significant improvement in terms of student grade.  iJOE -Vol. 14, No. 2, 2018 161 For the remaining 12 weeks, the correlation between the use of PythonTutor and student performance is not statistically significant. When discovered further, such finding might be caused by the high variance of student characteristics between both groups. Such variance might include student background, learning style, intelligence, and prior knowledge.

Conclusion and Future Work
This paper presents a quasi-experimental design to evaluate the impact of Python-Tutor for learning programming in 14-lecture-weeks laboratory sessions. Such sessions were held in even semester of 2016/2017 academic year and involved 4 classes of Basic Data Structure (BDS) course. According to our evaluation, several findings can be deducted which are: 1) PythonTutor might positively affect student performance when the students have used such tool before; 2) Sometimes, student perspective is not always in-sync with actual laboratory result; 3) the impact of PythonTutor might be improved when similar data representation is used consequently for several weeks; 4) the correlation between the use of PythonTutor and student performance might not be significant when the control and intervened group share completely different characteristics; 5) the students might feel some difficulties when they are asked to handle a big task for the first time; and 6) the students might be able to complete a particular weekly task with a promising result if the students have understood the material well.
For future work, we plan to develop a PV tool that, to some extent, is similar with PythonTutor but with more comprehensive features. These features are expected to mitigate the negative feedbacks that are reported in Karnalim & Ayub's work and the results of this work. Hopefully, such tool may help students to learn programming, especially in our university.