Evaluating AI and human authorship quality in academic writing through physics essays

Yeadon, Will; Agra, Elise; Inyang, Oto-Obong; Mackay, Paul; Mizouri, Arin

doi:10.1088/1361-6404/ad669d

Evaluating AI and human authorship quality in academic writing through physics essays

Yeadon, Will; Agra, Elise; Inyang, Oto-Obong; Mackay, Paul; Mizouri, Arin

Authors

Dr Will Yeadon will.yeadon@durham.ac.uk
Assistant Professor

Dr Elise Agra elise.s.agra@durham.ac.uk
Career Development Fellow

Dr Oto Obong Inyang o.o.a.inyang@durham.ac.uk
Assistant Professor

Dr Paul Mackay paul.t.mackay@durham.ac.uk
Career Development Fellow

Dr Arin Mizouri arin.mizouri@durham.ac.uk
Assistant Professor

Abstract

This study aims to compare the academic writing quality and detectability of authorship between human and AI-generated texts by evaluating n = 300 short-form physics essay submissions, equally divided between student work submitted before the introduction of ChatGPT and those generated by OpenAI’s GPT-4. In blinded evaluations conducted by five independent markers who were unaware of the origin of the essays, we observed no statistically significant differences in scores between essays authored by humans and those produced by AI (p-value = 0.107, α = 0.05). Additionally, when the markers subsequently attempted to identify the authorship of the essays on a 4-point Likert scale—from ‘Definitely AI’ to ‘Definitely Human’—their performance was only marginally better than random chance. This outcome not only underscores the convergence of AI and human authorship quality but also highlights the difficulty of discerning AI-generated content solely through human judgment. Furthermore, the effectiveness of five commercially available software tools for identifying essay authorship was evaluated. Among these, ZeroGPT was the most accurate, achieving a 98% accuracy rate and a precision score of 1.0 when its classifications were reduced to binary outcomes. This result is a source of potential optimism for maintaining assessment integrity. Finally, we propose that texts with ≤50% AI-generated content should be considered the upper limit for classification as human-authored, a boundary inclusive of a future with ubiquitous AI assistance whilst also respecting human-authorship.

Citation

Yeadon, W., Agra, E., Inyang, O.-O., Mackay, P., & Mizouri, A. (2024). Evaluating AI and human authorship quality in academic writing through physics essays. European Journal of Physics, 45(5), Article 055703. https://doi.org/10.1088/1361-6404/ad669d

Journal Article Type	Article
Acceptance Date	Jul 23, 2024
Online Publication Date	Sep 2, 2024
Publication Date	Sep 1, 2024
Deposit Date	Sep 13, 2024
Publicly Available Date	Sep 13, 2024
Journal	European Journal of Physics
Print ISSN	0143-0807
Electronic ISSN	1361-6404
Publisher	IOP Publishing
Peer Reviewed	Peer Reviewed
Volume	45
Issue	5
Article Number	055703
DOI	https://doi.org/10.1088/1361-6404/ad669d
Keywords	benchmark, ChatGPT, AI, academic writing
Public URL	https://durham-repository.worktribe.com/output/2800164