A Scalable Code Similarity Detection with Online Architecture and Focused Comparison for Maintaining Academic Integrity in Programming

Many code similarity detection techniques have been developed to maintain academic integrity in programming. However, most of them assume that the student programs are locally available, and the computation can be run on any computer specification. Further, their comparison in raising suspicion is time-consuming as the student programs are pairwise compared to one another. This paper proposes a scalable code similarity detection with online architecture and focused comparison. The former enables student programs shared among lecturers and guarantees that the computation is runnable. The latter shortens the execution time as only some students are considered, with inclusion criteria determined by the lecturers. To boost up the scalability, the similarity algorithm is cosine correlation, which computation is linear time. Our evaluation shows that focused comparison leads to fewer comparisons and cosine correlation leads to shorter execution time.


Introduction
Maintaining academic integrity is a serious concern in engineering education [1], [2], especially with the introduction of MOOC [3], [4]. Several strategies have been proposed in which one of the most popular ones is the use of Turnitin [5]. However, only few of them are applicable for programming courses [6], even though these courses are common in many engineering major curriculum. A possible reason behind this is the differences between standard text and source code [6].
In general, strategies for maintaining academic integrity in programming can be classified to five categories [7]. Educating the students about that kind of integrity is probably the most obvious one. This is usually carried out at the beginning of the course, with a lecturer or tutor explaining the acceptable practices [8]. Cheating can also be mitigated by discouraging such a behavior (e.g., incorporating additional assessment measures [2]), reducing the benefits of cheating (e.g., lowering the score of each assessment, making it not worthy to cheat), or putting more assessment re-putation can be performed regardless the lecturer's personal computer's specification since that computation will be carried out by the server. The focused comparison can shorten the execution time as not all student programs will be compared, with the inclusion criteria defined by the lecturers. For scalability, cosine correlation from information retrieval is used to measure the similarities.

Methodology
Four stages are required in using our detection: student program collection, perpetrator candidate selection, plagiarism detection, and in-depth discussion. It accepts either Java or Python student programs as the input.
Student program collection means that all participating lecturers should upload their own student programs to the server. At this stage, student programs from previous courses can also be uploaded if needed. For weekly assessments with different class schedules, an agreement can be made among lecturers to upload the student programs no later than a particular day.
Perpetrator candidate selection is performed manually by each lecturer. Per class, a set of students is selected based on the lecturer's suspicion. These students can be those who lack of programming skill, seldom attend the classes, or have previously breached academic integrity. For objectivity, such criteria can be discussed among participating lecturers at the beginning of the course.
Plagiarism detection is carried out separately per class. The perpetrator candidates' programs are given as queries to the detection technique, and per query, any similar student programs will be retrieved in descending order based on their similarity degree. Fig. 1 shows our detection technique's layout for this stage in which 'selected student programs' are the perpetrator candidates' (selected via search box above) and 'search result' lists any similar student programs for a particular query (selected by clicking that query from 'selected student programs'). To avoid over information, only five search results are given per query.
Search result per query is determined by comparing the query to all student programs uploaded in the server (except the query itself). Compared to other detection techniques that pairwise compare all possible combinations, this is more time efficient due to its linear computation.
The comparison itself (referred as CosineTS) is performed in twofold. At first, the student programs are converted to token strings with the help of ANTLR [24], in which comments and whitespaces are removed as they are easy to disguise. After that, the query's string will be compared to each student program's string with cosine correlation, a similarity measurement adapted from information retrieval [16]. Compared to string matching algorithms used in many detection techniques, this is also more time efficient thanks to its linear time complexity [17]. Suspected student pairs are formed by pairing each query and one of its search results.
In addition to CosineTS, three other comparison modes are provided. RKRGST converts the student programs to token strings and then measure the similarities via running Karp-Rabin greedy string tiling [25], a common string-matching algorithm for code similarity detection techniques [32]. Structure works similarly but the token strings are the result of linearising the syntax trees in a pre-order manner, inspired from two former studies [26], [27]. These tokens are expected to be more resistant to surface modification as most of them cannot be modified directly at source code level.
CosineAST is similar to Structure except that the similarity measurement is cosine correlation instead of running Karp-Rabin greedy string tiling. This is actually a simplified version of a technique proposed in [17], expecting to be less time consuming.
Several studies [28], [29] state that high similarity does not necessarily entail plagiarism. Hence, there is a need to revalidate similarities in the suspected pairs, whether they are likely from plagiarism [30]. Our detection technique supports this investigation for each pair by showing the code content of the student programs side-by-side as seen in Fig. 1. For convenience, similar fragments are highlighted in green. This layout is remodeled from JPlag [10] with the help of Plago [31]. If Structure or Co-sineAST is used as the similarity algorithm, the layout will include syntax tree tokens and show the code contents as two lists of tokens (see Fig. 2). Similar to the standard layout, the similarities are highlighted in green. After the suspected pairs of each class have been revalidated, an in-depth discussion should be conducted, assuring that no independent programs are listed in the suspected pairs. Some students may feel discouraged if they are wrongly accused. It is advised that the discussion involves former lecturers or tutors of the suspected students. If the work seems to be copied from another class or course, the lecturer from that class or course should also be invited. At the end of this stage, suspected students have been selected and they will be penalized according to the course's policy.

Evaluation and Discussion
This section evaluates the impact of focused comparison, the impact of cosine correlation and the impact of our proposed comparison modes (CosineTS, RKRGST, CosineAST, and Structure).

The impact of focused comparison
Focused comparison is expected to be more time efficient as not all student programs are compared. To prove this, the comparison was compared with the naïve one (which exhaustively include all possible comparison pairs) for ten different numbers of student programs, starting from 0 to 100 with 10 for each adjacent difference. The focused comparison was featured with 10% number of student programs as the queries. Fig. 3 shows that focused comparison results in fewer comparison pairs than the naïve one and the difference becomes more salient when many student programs are involved. This is expected as that comparison makes the number of student programs linear to the number of comparison pairs while naïve comparison considers the relation as quadratic.
Taking the most extreme scenario with 100 student programs, focused comparison can exclude 3950 comparison pairs, which leads to 79.8% reduction. If each comparison pair takes one second, this can save about one hour of execution time.
We are aware that when the number of student programs are low (e.g., when the number of student programs is 10), the difference becomes harder to see. However, it still leads to fewer comparison pairs than the naïve one, except when no student programs are considered.

The impact of cosine correlation
Theoretically, cosine correlation is faster than running Karp-Rabin greedy string tiling [25], a common string-matching algorithm for this task [32], as the former only takes linear computation time while the latter takes it quadratically. This subsection evaluates how much is the time reduction caused by replacing the latter with the former.
The evaluation involves two token representations: regular and syntax tree token strings. They are used in our comparison modes. Regular token string is resulted from tokenising the source code directly with ANTLR and it is used for CosineTS and RKRGST. Syntax tree token string is resulted from linearising the syntax tree in preorder manner. It is used for CosineAST and Structure. Each representation has one mode with cosine correlation (either CosineTS or CosineAST) and another mode with running Karp-Rabin greedy string tiling (either RKRGST or Structure).
The reduced execution time was measured in twofold. At first, the execution time of each mode is measured by searching the copied programs of a student program in two sets of introductory programming assessments (with 2426 Python files in total). After that, the time difference between the two modes is calculated and normalised to the execution time of the string-matching mode (either RKRGST or Structure). Fig. 4 shows that replacing running Karp-Rabin greedy string tiling with cosine correlation results in shorter execution time. It reduces about 11% for regular token string and 38.9% for the syntax tree one. Time reduction for syntax tree token string is larger since the strings have more tokens as a result of linearising the syntax trees, and such larger number of tokens leads to longer execution time for string-matching algorithm due to its quadratic computation.

The impact of comparison modes
Four comparison modes (CosineTS, RKRGST, CosineAST, and Structure) are proposed for the detection. This subsection evaluates the impact of those modes under two evaluation metrics: f-score and execution time. The former covers effectiveness while the latter covers efficiency. This is expected to provide a brief summary about the characteristics of the proposed modes.
F-score is often used to measure effectiveness in general where higher value is preferred. It is the harmonic mean between precision and recall, calculated as in (1). Precision is the proportion of copied student programs in the suspected results. The equation can be seen in (2) and it is resulted from dividing the number of true positives with the sum of the number of true and false positives. Recall is the proportion of suspected results in the copied student programs. The equation can be seen in (3) and it is resulted from dividing true positives with the number of true positives plus the number of false negatives.
For measuring f-score, a Java introductory programming data set in [32] was used. The data set covers seven introductory programming materials: output, input, branching, looping, array, method, and matrix. Copied programs are mapped to six plagiarism levels defined by [13]: comment and whitespace modification, identifier renaming, component declaration relocation, method structure change, program statement replacement, and logic change. They are referred as level-1 to level-6 respectively where higher level is more difficult to apply and less frequently found in real cases. In total, the data set contains 7 original student programs (or queries), 105 independent student programs, and 355 copied student programs toward the originals. Each query has 20 independent student programs and up to 36 copied student programs (up to 9 programs per plagiarism level).
Execution time (in seconds) was recorded similarly as the one used for measuring the impact of cosine correlation. It is the amount of time required for searching the copied programs of a student program in 2426 Python files. Lower value is preferred for this metric as faster execution is proportional to higher scalability. Fig. 5 shows that all four comparison modes are equally effective for the first plagiarism level. It is expected as that level is focused on modifying comments and whitespaces, two components that are ignored by all modes. On level-2 (which is about identifier renaming), Structure becomes the most effective as the modification does not change token order and its impact can be mitigated with the consideration of syntax tokens. This mode, however, becomes as effective as RKRGST on level-3 (which is about component declaration relocation); relocating declaration statements does not change the syntax tokens, leading to no improvement with those tokens on board.
For the remaining levels (which are about method structure change, program statement replacement, and logic change), Structure becomes the least effective as the modification affects syntax tokens, enlarging the number of mismatches. CosineTS, which is the least effective on the first three levels, gradually experiences effectiveness improvement, making it the second highest on level-6 (logic change).
In terms of efficiency (see Fig. 6), CosineTS is the most effective one, taking only about 89 seconds to process 2426 comparisons. This is followed by RKRGST that takes more time for calculating the similarities; its algorithm has quadratic complexity while CosineTS's algorithm is linear time.
CosineAST and Structure are slower than the first two as syntax trees should be generated and linearized prior comparison. Time required for that tree generation can be longer if the code is complex [17]. Structure is the slowest one due to the combination of quadratic similarity algorithm and tree generation.
To sum up, Structure is exclusively beneficial to deal with modifications related to identifier renaming, while RKRGST is the most effective one for remaining levels. For scalability, CosineTS is the most preferred one due to its fast computation, followed by RKRGST, CosineAST, and Structure.
CosineTS is advised if many student programs are considered. However, if only few of them are involved, RKRGST can be used for higher effectiveness. CosineAST can be used as a replacement of CosineTS if students tend to disguise their programs with identifier renaming. This is also similar to Structure, which can replace RKRGST for dealing with identifier renaming.

Conclusion and Future Work
A scalable code similarity detection for maintaining academic integrity in programming is proposed in this paper. It is uniquely featured with online architecture and focused comparison. The former facilitates student program sharing among lecturers and assures that the computation can be performed regardless the lecturers' personal computers' specification. The latter can shorten the execution time as only some student programs are considered. To enhance the scalability, the similarity measurement is cosine correlation, an algorithm with linear time complexity.
According to our evaluation, focused comparison can exclude many comparison pairs if many student programs are involved. With 100 student programs on board, it can exclude 79.8% comparison pairs. This obviously leads to shorter execution time as time is proportional to the number of comparisons.
Replacing running Karp-Rabin greedy string tiling with cosine correlation can also shorten the execution time. The benefit becomes larger when the token strings are longer (such as those resulted from linearized syntax trees).
Our detection technique is featured with four comparison modes: CosineTS, RKRGST, CosineAST, and Structure. Among those, CosineTS is the most scalable one while RKRGST is the most effective one for most plagiarism levels. Other two modes can be used in dealing with small-sized student programs which modifications are mainly about renaming identifiers.
Our detection technique is considerably scalable as it can search copied programs from 2426 student programs for less than one and a half minute. It is also effective in dealing with superficial modifications that are commonly found in programming assessments (the first three plagiarism levels).
For future work, we plan to use the detection technique for some programming courses and summarise the experiences. It is expected to enrich our current findings from user perspective. In addition, we also plan to evaluate the comparison modes with other metrics to gain deeper understanding of their characteristics.

5
Mewati Ayub graduated with a Bachelor of Informatics from Bandung Institute of Technology (ITB) in 1986, and completed her Master's degree at Bandung Institute of Technology in 1996, and her doctoral degree at Bandung Institute of Technology in 2006. She has been working as a faculty member in the Faculty of Information Technology at Maranatha Christian University since 2006. Her specialty is in the field of computer science education, software engineering, and data analytics.