Abstract:Academics and departments are sometimes judged by how their research has benefitted society. For example, the UK Research Excellence Framework (REF) assesses Impact Case Studies (ICS), which are five-page evidence-based claims of societal impacts. This study investigates whether ChatGPT can evaluate societal impact claims and therefore potentially support expert human assessors. For this, various parts of 6,220 public ICS from REF2021 were fed to ChatGPT 4o-mini along with the REF2021 evaluation guidelines, comparing the results with published departmental average ICS scores. The results suggest that the optimal strategy for high correlations with expert scores is to input the title and summary of an ICS but not the remaining text, and to modify the original REF guidelines to encourage a stricter evaluation. The scores generated by this approach correlated positively with departmental average scores in all 34 Units of Assessment (UoAs), with values between 0.18 (Economics and Econometrics) and 0.56 (Psychology, Psychiatry and Neuroscience). At the departmental level, the corresponding correlations were higher, reaching 0.71 for Sport and Exercise Sciences, Leisure and Tourism. Thus, ChatGPT-based ICS evaluations are simple and viable to support or cross-check expert judgments, although their value varies substantially between fields.
Abstract:National research evaluation initiatives and incentive schemes have previously chosen between simplistic quantitative indicators and time-consuming peer review, sometimes supported by bibliometrics. Here we assess whether artificial intelligence (AI) could provide a third alternative, estimating article quality using more multiple bibliometric and metadata inputs. We investigated this using provisional three-level REF2021 peer review scores for 84,966 articles submitted to the UK Research Excellence Framework 2021, matching a Scopus record 2014-18 and with a substantial abstract. We found that accuracy is highest in the medical and physical sciences Units of Assessment (UoAs) and economics, reaching 42% above the baseline (72% overall) in the best case. This is based on 1000 bibliometric inputs and half of the articles used for training in each UoA. Prediction accuracies above the baseline for the social science, mathematics, engineering, arts, and humanities UoAs were much lower or close to zero. The Random Forest Classifier (standard or ordinal) and Extreme Gradient Boosting Classifier algorithms performed best from the 32 tested. Accuracy was lower if UoAs were merged or replaced by Scopus broad categories. We increased accuracy with an active learning strategy and by selecting articles with higher prediction probabilities, as estimated by the algorithms, but this substantially reduced the number of scores predicted.
Abstract:This document describes strategies for using Artificial Intelligence (AI) to predict some journal article scores in future research assessment exercises. Five strategies have been assessed.
Abstract:This literature review identifies indicators that associate with higher impact or higher quality research from article text (e.g., titles, abstracts, lengths, cited references and readability) or metadata (e.g., the number of authors, international or domestic collaborations, journal impact factors and authors' h-index). This includes studies that used machine learning techniques to predict citation counts or quality scores for journal articles or conference papers. The literature review also includes evidence about the strength of association between bibliometric indicators and quality score rankings from previous UK Research Assessment Exercises (RAEs) and REFs in different subjects and years and similar evidence from other countries (e.g., Australia and Italy). In support of this, the document also surveys studies that used public datasets of citations, social media indictors or open review texts (e.g., Dimensions, OpenCitations, Altmetric.com and Publons) to help predict the scholarly impact of articles. The results of this part of the literature review were used to inform the experiments using machine learning to predict REF journal article quality scores, as reported in the AI experiments report for this project. The literature review also covers technology to automate editorial processes, to provide quality control for papers and reviewers' suggestions, to match reviewers with articles, and to automatically categorise journal articles into fields. Bias and transparency in technology assisted assessment are also discussed.