Replacing the REF assessment of research outputs by a process of collecting peer-provided ratings

Posted by on December 10, 2015

Research Assessment FrameworkThere are several countries where the performance of research institutions is assessed periodically, typically for allocating institutional funding. The UK’s Research Excellence Framework (REF), formerly named Research Assessment Exercise, is a prototypical example of such assessments. A key part of REF is the peer review, by about 900 panel members, of a large number of research outputs (191,150 in the last REF) submitted by the assessed higher education institutions (HEIs).

The Metric Tide, a recent  report commissioned by the Higher Education Funding Council of England, investigating the role of metrics in research assessment, analysed, among others, their potential role in the next REF and found that traditional metrics cannot supplant the peer review component of the REF. But, in a previous blog post, I have argued that aggregating ratings provided by post-publication peer review by the scientists who read the publications for their own research leads to metrics that are superior to the traditional, citation-based ones.

Could these new metrics based on ratings replace the peer review part of the REF? I will argue that research assessment can be achieved better and at a much lower cost using rating-based indicators rather than the current organization of the REF. This probably generalizes to other national assessments of research.

A new proposed process

Let us assume that the REF organizers, instead of requesting HEIs to select and submit information about their staff and research outputs, they request HEIs to ask their research-performing academic staff to publish a rating for each scientific publication they read in its entirety and with great care, that has at least a UK-based author, and has been published in the last 6 years (the period covered by the last REF). Also, let us assume that the number of staff pertaining to this category equals the number of the staff submitted to the 2014 REF. According to the computation detailed in the Appendix below, this will lead, in 6 years, to about 113,450 articles with UK authors that will receive at least 3 ratings, 71,586 articles with 2 ratings and 156,187 articles with one rating. The number of articles with at least 2 ratings, 185,036, is 18% higher than the number of articles that were submitted as outputs to REF, 157,021, and the analysis can be extrapolated proportionally to other types of research outputs.

Advantages of the process based on the aggregation of ratings

If two assessors read each output in the REF, then this is equivalent to the information provided by two independent ratings. In reference to the available official information, at least two assessors per output were used by Main Panel A; two assessors per output were used by Sub-Panel 17; but I could not find official information regarding this issue for other panels or subpanels. Assuming that, on average, two REF assessors read each output, by using the new proposed process of aggregating ratings, the number of outputs for which the equivalent evaluative information is obtained is 18% higher than in REF. For about 72% of the number of outputs reviewed in the REF, the aggregation of ratings will lead to at least 3 instead of 2 ratings, i.e. more than 50% additional evaluative information per output.

Another source of extra evaluative information is the format of the ratings: while in the REF outputs are classified into 5 categories, ratings can be expressed on a scale of 100 percentile ranks, and the uncertainty of the reviewer is also collected in the form of an interval of percentile ranks.

The most important improvement brought by the aggregation of ratings is the involvement of many more scientists than the assessors used by REF (e.g., more than 50,000 instead of 898 or 934). It has been previously argued that the REF panel members do not necessarily have sufficient expertise in the core field of the assessed outputs, and that “the mechanisms through which panelists are recruited are tailor-made for the sponsored replication of disciplinary elites”. In the case of the proposed process, ratings will be given by the scientists who use the outputs for their own research needs, and, therefore they will provide ratings for outputs that belong to their core field of expertise. There are concerns about the biasness of REF against interdisciplinarity, which the proposed process will eliminate.

Therefore, the quantity and quality of the evaluative information regarding research outputs obtained by this method of aggregating ratings will be much higher than the one provided by the last REF.

Moreover, instead of the assessment exercise taking place once every 6 years, by using the new proposed process the evaluative information will be available in real time, as scientists read new publications. The allocation of funds could thus be adapted on a finer time scale (e.g., yearly) in response to the changes of research quality and importance in the considered time interval (e.g., the last 6 years).

Significant cost decreases

According to official estimates, REF cost £246M. Out of this amount, £78M represented  research outputs-related costs supported by the HEIs (including the costs of panelists, central management and coordination costs incurred in relation to research outputs, costs of reviewing / negotiating selection of staff and publications, and costs of validating / extending bibliographic records for submitted research outputs; see the Appendix). These costs will not be incurred if using the proposed process, because no panelists will be needed to review publications and HEIs will neither have to select outputs nor incur costs for selecting staff. The rated publications are implicitly selected by scientists while they are looking for publications that are useful for their own work. The staff that is eligible to submit ratings could be selected broadly according to criteria that do not require any selection effort, e.g. according to their academic position within a HEI.

With the proposed process, the extra time that will be spent by scientists in providing the ratings over 6 years is estimated to cost less than £4M (see Appendix). Therefore, the net savings relative to REF will amount to more than £74M.

Towards an international process for aggregating ratings

If the ratings are shared publicly, through a platform such as Epistemio, then the same process of aggregating ratings could be used by funders from many countries. Processes similar to the REF are currently used, e.g., in Australia, France, Hong Kong, Italy, the Netherlands, New Zealand, Portugal, Romania, and Spain. If scientists agree to publish ratings not only of publications with authors from their country, but also for all recent publications that they read thoroughly, perhaps as a consequence of agreements between national organizers of assessment exercises, then the quantity of evaluative information will increase significantly. As already mentioned above, if each scientist would publish one rating weekly, 52% of publications would get at least 10 ratings.

The aggregated ratings could be used not only for the allocation of institutional funding, but also for the assessment of individuals that apply for jobs or promotions. For example, 10 ratings for each of the 10 best papers of the assessed individual, given by 50-100 international experts in the core field of each publication, could provide a much deeper and accurate evaluative information than one typically available from the traditional 3 reference letters, or from a typical hiring committee of 5-10 members who may lack the time to read thoroughly all the 10 publications nor always have expertise in the core fields of the assessed publications, similarly to the case of the REF panel members.

Other considerations

The calibration of the assessments within the panels was an important component of the REF assessment process. Such a calibration was needed because, e.g., categorizing one research output as “world leading” vs. “internationally excellent” is not obvious. The Epistemio rating scale uses the set of all publications read by the reviewer as a reference. Calibration between reviewers is implicitly achieved if the reviewers, through their training and research experience, have read a large sample of the relevant research in their fields that highly overlaps with the sample read by other experts in these fields. Automated normalization methods, such as the one used by the Computer Science and Informatics REF panel, may also be used.

The possibility of gaming is an important concern for such a process. Rings or cartels that engage in abnormal mutual exchanges of positive ratings can be detected automatically and be eliminated from the analysis after further investigation. Obvious conflicts of interest can be automatically detected given information about the present and former institutional affiliations of scientists and about their co-authorships. It is also the duty of scientists to conduct themselves ethically and refrain from rating publications in the case of conflicts of interest. The organizers of national assessment exercises could arrange to have the agreement of participating scientists to disclose their identity to these organizers, even though their anonymity to other parties will be preserved. In this case, the organizers could use typical processes for screening conflicts of interest, such as those used by funding agencies for the assessment of proposals. The use of ORCIDs and of institutional email addresses that can be linked to publications and institutional affiliations can prevent fraud by identity theft.


According to a study of a sample of US faculty members, one of them reads, on average, 252 scholarly articles per year. Out of these, 30.8% (at least 77) are read in their entirety and with great care. About 78% of these (at least 60) are articles published in the last 6 years. I will thus consider that each of the UK research-performing academic staff can reliably rate at least 60 articles per year. This is a conservative extrapolation, because it does not include the articles that are read partly with great care (an extra 33.4% of readings).

I consider the number of the UK research-performing academic staff that will provide ratings equals the number of the staff submitted to the 2014 REF, i.e. 52,061 (about 27% of the academic staff employed in the UK HEIs). I also consider that the share of publications that have at least one UK author in the publications that they read equals the share of publications with at least one UK author in the world’s scientific publications. This is a conservative estimate, because UK publications could be read more because of their higher than average quality and better local dissemination of local publications.

Adding the number of documents published in each journal listed by SCImagoJR in 2014 results in a total number of 2,351,806 documents, out of which 160,935 (6.84%) are attributed to the UK.

The number of ratings of articles from the last 6 years and with UK authors that can be provided by the UK research-performing academic staff, per year, is 52,061 x 60 x 6.84% ≈ 213,658. The average number of ratings per article with UK authors is 213,658 / 160,935 ≈ 1.33.

The ratings are distributed unevenly across articles: some are not read (and rated) at all, while some are read (and rated) multiple times. Considering that the distribution of readings across articles is similar to the distribution of citations, I used a previously published model to compute that about 12% of articles (i.e., 113,450 in 6 years) will get at least 3 ratings, about 7% (71,586) will get 2 ratings and about 16% (156,187) will get one rating.

To estimate the time spent by scientists to read the articles that are rated, I considered that the longest reading durations reported in the study of a sample of US faculty members are for the articles that are read in their entirety and with great care, i.e. those that will be rated. Then, 28% of articles (8.7% / 30.8%) would be rated after being read for more than one hour, 68% (20.8% / 30.8%) would be rated after being read between half an hour and one hour, and the remaining 4% would be rated after being read between 11 and 30 minutes. These estimates are similar to the estimate of less than one hour spent by one REF assessor for reading one of the submitted research outputs if each output was read by two assessors. It has been previously argued that this time spent per output is much less than the one spent by the reviewers who assess a manuscript prior to publication, and indeed, a pre-publication review for a paper takes, on average, 8.5 h (median 5 h) for a typical scientist, and 6.8 h for an active reviewer. However, this is probably because of the relatively more effort needed for devising improvements to the current form of the manuscript and putting them in writing, while, as described above, the average time needed for reading thoroughly an article for the scientist’s own research needs is around one hour.

To estimate the extra time spent by a scientist for providing the rating, we estimate that the extra time spent on one rating is 5 minutes. Taking into account, as above, that a scientist can rate 60 articles per year out of which 6.84% are from the UK; assuming that a full-time job comprises 1950 work hours per year; and overestimating the average cost of the yearly salary of reviewers to the level of a senior-grade academic, £69,410, we get the total cost of providing ratings over 6 years as 60 x 6.84% x 5 / 60 / 1950 x 69,410 x 52,061 x 6 ≈ £3.80M.

According to data from the official REF Accountability Review, the research outputs-related costs of REF can be computed to amount to £78.04M, given that:

  • The cost of panelists (excluding costs related to impact assessment) was £19M;
  • Out of the £44M reported for central management and coordination costs (within the HEIs), an average of 40% are reported to be incurred in relation to research outputs, i.e. £17.60M);
  • Out of the £112M reported for costs at the unit-of-assessment level, excluding costs for impact statements and case studies, an average of 55% was spent on reviewing / negotiating selection of staff and publications, and of 12% for validating / extending bibliographic records for submitted research outputs. The total is 37%, i.e. £41.44M.