AI-Enhanced Auto-Correction of Programming Exercises: How Effective is GPT-3.5?

Imen Azaiz; Oliver Deckarm; Sven Strickroth

doi:10.3991/ijep.v13i8.45621

Authors

Imen Azaiz LMU Munich https://orcid.org/0009-0005-6458-4169
Oliver Deckarm LMU Munich https://orcid.org/0009-0008-2919-4044
Sven Strickroth LMU Munich https://orcid.org/0000-0002-9647-300X

DOI:

https://doi.org/10.3991/ijep.v13i8.45621

Keywords:

E-Assessment, Personalized Feedback, GPT-3.5, Large Language Model, Programming Education, Formative Assessment

Abstract

Timely formative feedback is considered as one of the most important drivers for effective learning. Delivering timely and individualized feedback is particularly challenging in large classes in higher education. Recently Large Language Models such as GPT-3 became available to the public that showed promising results on various tasks such as code generation and code explanation. This paper investigates the potential of AI in providing personalized code correction and generating feedback. Based on existing student submissions of two different real-world assignments, the correctness of the AI-aided e-assessment as well as the characteristics such as fault localization, correctness of hints, and code style suggestions of the generated feedback are investigated. The results show that 73% of the submissions were correctly identified as either correct or incorrect. In 59% of these cases, GPT-3.5 also successfully generated effective and high-quality feedback. Additionally, GPT-3.5 exhibited weaknesses in its evaluation, including localization of errors that were not the actual errors, or even hallucinated errors. Implications and potential new usage scenarios are discussed.

AI-Enhanced Auto-Correction of Programming Exercises: How Effective is GPT-3.5?

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

License

Rankings

Other journals