Examining the Performance of Artificial Intelligence in Scoring Students' Handwritten Responses to Open-Ended Items

Mahmut Sami Yiğiter, Erdem Boduroğlu

Abstract

Open-ended items, which have been used as a measurement method for centuries in the evaluation of student achievement, have many advantages, such as measuring high-level skills, providing rich diagnostic information about the student, and not having chance success. However, today, open-ended items cannot be used in exams with a large number of students due to the potential for errors in the scoring process and disadvantages in terms of labour, time, and cost. At this point, Artificial Intelligence (AI) has an important potential in scoring open-ended items. The aim of this study is to examine the scoring performance of AI in scoring students' handwritten responses to open-ended items. In the study, an achievement test consisting of 3 open-ended and 10 multiple-choice items was developed within the scope of the Measurement and Assessment in Education course at a state university. Open-ended items were scored in a structured way (0-1-2), while multiple-choice items were scored as true-false (0-1). 84 participants took part in the study, and the open-ended items were scored by the expert group and the AI tool (ChatGPT-4o). The visual responses written by the students in their handwriting were scored by the AI tool in two different scenarios. In the first scenario, the AI tool was asked to score without giving any scoring criteria to the AI, whereas in the second scenario, the AI was asked to score according to the standard scoring criteria. The findings of the study showed that there were low agreement and correlation coefficients between the AI scores without criteria and expert scores, while there were high agreement and correlation coefficients between the AI scores with standard scoring criteria and expert scores. Similar to these findings, while the item discriminations of the AI scoring without criteria were quite low, the item discriminations of the AI scores with standard scoring criteria were high. In the study, the reasons for the discrepancies between expert scores and AI scores with standard criteria were also investigated and reported. The results show that AI can score handwritten open-ended items with standardized scoring criteria at a good level. In the future, with the development and transformation of AI, it is thought that it can reach scoring accuracy comparable to expert raters in terms of consistency.

Keywords

Open-ended item, Artificial intelligence, AI, ChatGPT, Automated scoring, Handwritten responses, Constructed response item


DOI: http://dx.doi.org/10.15390/EB.2025.14119

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.