Can we evaluate the quality of software engineering experiments?

Kitchenham, Barbara; Sjøberg, Dag I.K.; Brereton, O. Pearl; Budgen, David; Dybå, Tore; Höst, Martin; Pfahl, Dietmar; Runeson, Per

doi:10.1145/1852786.1852789

Can we evaluate the quality of software engineering experiments?

Kitchenham, Barbara; Sjøberg, Dag I.K.; Brereton, O. Pearl; Budgen, David; Dybå, Tore; Höst, Martin; Pfahl, Dietmar; Runeson, Per

Authors

Barbara Kitchenham

Dag I.K. Sjøberg

O. Pearl Brereton

David Budgen david.budgen@durham.ac.uk
Emeritus Professor

Tore Dybå

Martin Höst

Dietmar Pfahl

Per Runeson

Abstract

Context: The authors wanted to assess whether the quality of published human-centric software engineering experiments was improving. This required a reliable means of assessing the quality of such experiments. Aims: The aims of the study were to confirm the usability of a quality evaluation checklist, determine how many reviewers were needed per paper that reports an experiment, and specify an appropriate process for evaluating quality. Method: With eight reviewers and four papers describing human-centric software engineering experiments, we used a quality checklist with nine questions. We conducted the study in two parts: the first was based on individual assessments and the second on collaborative evaluations. Results: The inter-rater reliability was poor for individual assessments but much better for joint evaluations. Four reviewers working in two pairs with discussion were more reliable than eight reviewers with no discussion. The sum of the nine criteria was more reliable than individual questions or a simple overall assessment. Conclusions: If quality evaluation is critical, more than two reviewers are required and a round of discussion is necessary. We advise using quality criteria and basing the final assessment on the sum of the aggregated criteria. The restricted number of papers used and the relatively extensive expertise of the reviewers limit our results. In addition, the results of the second part of the study could have been affected by removing a time restriction on the review as well as the consultation process. © 2010 ACM.

Citation

Kitchenham, B., Sjøberg, D. I., Brereton, O. P., Budgen, D., Dybå, T., Höst, M., Pfahl, D., & Runeson, P. (2010, September). Can we evaluate the quality of software engineering experiments?. Presented at ESEM 2010 - 2010 ACM-IEEE International Symposium on Empirical Software Engineering and Measurement, Bolzano-Bozen, Italy

Presentation Conference Type	Conference Paper (published)
Conference Name	ESEM 2010 - 2010 ACM-IEEE International Symposium on Empirical Software Engineering and Measurement
Start Date	Sep 16, 2010
End Date	Sep 17, 2010
Online Publication Date	Sep 16, 2010
Publication Date	Nov 12, 2010
Deposit Date	Feb 23, 2025
Peer Reviewed	Peer Reviewed
DOI	https://doi.org/10.1145/1852786.1852789
Public URL	https://durham-repository.worktribe.com/output/3500729

How Should Software Engineering Secondary Studies Include Grey Material? (2022)
Journal Article

SEGRESS: Software Engineering Guidelines for REporting Secondary Studies (2022)
Journal Article

Short communication: Evolution of secondary studies in software engineering (2022)
Journal Article

Improving Current Glycated Hemoglobin Prediction in Adults: Use of Machine Learning Algorithms with Electronic Health Records (2021)
Journal Article

A Service Scheduling Security Model for a Cloud Environment (2020)
Journal Article

Downloadable Citations

HTML

BIB

RTF

Authors

Abstract

Citation

You might also like

Downloadable Citations