Replacing the REF assessment of research outputs by a process of collecting peer-provided ratings

Posted by Răzvan Valentin Florian on December 10, 2015

There are several countries where the performance of research institutions is assessed periodically, typically for allocating institutional funding. The UK’s Research Excellence Framework (REF), formerly named Research Assessment Exercise, is a prototypical example of such assessments. A key part of REF is the peer review, by about 900 panel members, of a large number of research outputs (191,150 in the last REF) submitted by the assessed higher education institutions (HEIs).

The Metric Tide, a recent report commissioned by the Higher Education Funding Council of England, investigating the role of metrics in research assessment, analysed, among others, their potential role in the next REF and found that traditional metrics cannot supplant the peer review component of the REF. But, in a previous blog post, I have argued that aggregating ratings provided by post-publication peer review by the scientists who read the publications for their own research leads to metrics that are superior to the traditional, citation-based ones.

Could these new metrics based on ratings replace the peer review part of the REF? I will argue that research assessment can be achieved better and at a much lower cost using rating-based indicators rather than the current organization of the REF. This probably generalizes to other national assessments of research.

A new proposed process

Let us assume that the REF organizers, instead of requesting HEIs to select and submit information about their staff and research outputs, they request HEIs to ask their research-performing academic staff to publish a rating for each scientific publication they read in its entirety and with great care, that has at least a UK-based author, and has been published in the last 6 years (the period covered by the last REF). Also, let us assume that the number of staff pertaining to this category equals the number of the staff submitted to the 2014 REF. According to the computation detailed in the Appendix below, this will lead, in 6 years, to about 113,450 articles with UK authors that will receive at least 3 ratings, 71,586 articles with 2 ratings and 156,187 articles with one rating. The number of articles with at least 2 ratings, 185,036, is 18% higher than the number of articles that were submitted as outputs to REF, 157,021, and the analysis can be extrapolated proportionally to other types of research outputs.

Advantages of the process based on the aggregation of ratings

If two assessors read each output in the REF, then this is equivalent to the information provided by two independent ratings. In reference to the available official information, at least two assessors per output were used by Main Panel A; two assessors per output were used by Sub-Panel 17; but I could not find official information regarding this issue for other panels or subpanels. Assuming that, on average, two REF assessors read each output, by using the new proposed process of aggregating ratings, the number of outputs for which the equivalent evaluative information is obtained is 18% higher than in REF. For about 72% of the number of outputs reviewed in the REF, the aggregation of ratings will lead to at least 3 instead of 2 ratings, i.e. more than 50% additional evaluative information per output.

Another source of extra evaluative information is the format of the ratings: while in the REF outputs are classified into 5 categories, ratings can be expressed on a scale of 100 percentile ranks, and the uncertainty of the reviewer is also collected in the form of an interval of percentile ranks.

The most important improvement brought by the aggregation of ratings is the involvement of many more scientists than the assessors used by REF (e.g., more than 50,000 instead of 898 or 934). It has been previously argued that the REF panel members do not necessarily have sufficient expertise in the core field of the assessed outputs, and that “the mechanisms through which panelists are recruited are tailor-made for the sponsored replication of disciplinary elites”. In the case of the proposed process, ratings will be given by the scientists who use the outputs for their own research needs, and, therefore they will provide ratings for outputs that belong to their core field of expertise. There are concerns about the biasness of REF against interdisciplinarity, which the proposed process will eliminate.

Therefore, the quantity and quality of the evaluative information regarding research outputs obtained by this method of aggregating ratings will be much higher than the one provided by the last REF.

Moreover, instead of the assessment exercise taking place once every 6 years, by using the new proposed process the evaluative information will be available in real time, as scientists read new publications. The allocation of funds could thus be adapted on a finer time scale (e.g., yearly) in response to the changes of research quality and importance in the considered time interval (e.g., the last 6 years).

Significant cost decreases

According to official estimates, REF cost £246M. Out of this amount, £78M represented research outputs-related costs supported by the HEIs (including the costs of panelists, central management and coordination costs incurred in relation to research outputs, costs of reviewing / negotiating selection of staff and publications, and costs of validating / extending bibliographic records for submitted research outputs; see the Appendix). These costs will not be incurred if using the proposed process, because no panelists will be needed to review publications and HEIs will neither have to select outputs nor incur costs for selecting staff. The rated publications are implicitly selected by scientists while they are looking for publications that are useful for their own work. The staff that is eligible to submit ratings could be selected broadly according to criteria that do not require any selection effort, e.g. according to their academic position within a HEI.

With the proposed process, the extra time that will be spent by scientists in providing the ratings over 6 years is estimated to cost less than £4M (see Appendix). Therefore, the net savings relative to REF will amount to more than £74M.

Towards an international process for aggregating ratings

If the ratings are shared publicly, through a platform such as Epistemio, then the same process of aggregating ratings could be used by funders from many countries. Processes similar to the REF are currently used, e.g., in Australia, France, Hong Kong, Italy, the Netherlands, New Zealand, Portugal, Romania, and Spain. If scientists agree to publish ratings not only of publications with authors from their country, but also for all recent publications that they read thoroughly, perhaps as a consequence of agreements between national organizers of assessment exercises, then the quantity of evaluative information will increase significantly. As already mentioned above, if each scientist would publish one rating weekly, 52% of publications would get at least 10 ratings.

The aggregated ratings could be used not only for the allocation of institutional funding, but also for the assessment of individuals that apply for jobs or promotions. For example, 10 ratings for each of the 10 best papers of the assessed individual, given by 50-100 international experts in the core field of each publication, could provide a much deeper and accurate evaluative information than one typically available from the traditional 3 reference letters, or from a typical hiring committee of 5-10 members who may lack the time to read thoroughly all the 10 publications nor always have expertise in the core fields of the assessed publications, similarly to the case of the REF panel members.

Other considerations

The calibration of the assessments within the panels was an important component of the REF assessment process. Such a calibration was needed because, e.g., categorizing one research output as “world leading” vs. “internationally excellent” is not obvious. The Epistemio rating scale uses the set of all publications read by the reviewer as a reference. Calibration between reviewers is implicitly achieved if the reviewers, through their training and research experience, have read a large sample of the relevant research in their fields that highly overlaps with the sample read by other experts in these fields. Automated normalization methods, such as the one used by the Computer Science and Informatics REF panel, may also be used.

The possibility of gaming is an important concern for such a process. Rings or cartels that engage in abnormal mutual exchanges of positive ratings can be detected automatically and be eliminated from the analysis after further investigation. Obvious conflicts of interest can be automatically detected given information about the present and former institutional affiliations of scientists and about their co-authorships. It is also the duty of scientists to conduct themselves ethically and refrain from rating publications in the case of conflicts of interest. The organizers of national assessment exercises could arrange to have the agreement of participating scientists to disclose their identity to these organizers, even though their anonymity to other parties will be preserved. In this case, the organizers could use typical processes for screening conflicts of interest, such as those used by funding agencies for the assessment of proposals. The use of ORCIDs and of institutional email addresses that can be linked to publications and institutional affiliations can prevent fraud by identity theft.

Appendix

According to a study of a sample of US faculty members, one of them reads, on average, 252 scholarly articles per year. Out of these, 30.8% (at least 77) are read in their entirety and with great care. About 78% of these (at least 60) are articles published in the last 6 years. I will thus consider that each of the UK research-performing academic staff can reliably rate at least 60 articles per year. This is a conservative extrapolation, because it does not include the articles that are read partly with great care (an extra 33.4% of readings).

I consider the number of the UK research-performing academic staff that will provide ratings equals the number of the staff submitted to the 2014 REF, i.e. 52,061 (about 27% of the academic staff employed in the UK HEIs). I also consider that the share of publications that have at least one UK author in the publications that they read equals the share of publications with at least one UK author in the world’s scientific publications. This is a conservative estimate, because UK publications could be read more because of their higher than average quality and better local dissemination of local publications.

Adding the number of documents published in each journal listed by SCImagoJR in 2014 results in a total number of 2,351,806 documents, out of which 160,935 (6.84%) are attributed to the UK.

The number of ratings of articles from the last 6 years and with UK authors that can be provided by the UK research-performing academic staff, per year, is 52,061 x 60 x 6.84% ≈ 213,658. The average number of ratings per article with UK authors is 213,658 / 160,935 ≈ 1.33.

The ratings are distributed unevenly across articles: some are not read (and rated) at all, while some are read (and rated) multiple times. Considering that the distribution of readings across articles is similar to the distribution of citations, I used a previously published model to compute that about 12% of articles (i.e., 113,450 in 6 years) will get at least 3 ratings, about 7% (71,586) will get 2 ratings and about 16% (156,187) will get one rating.

To estimate the time spent by scientists to read the articles that are rated, I considered that the longest reading durations reported in the study of a sample of US faculty members are for the articles that are read in their entirety and with great care, i.e. those that will be rated. Then, 28% of articles (8.7% / 30.8%) would be rated after being read for more than one hour, 68% (20.8% / 30.8%) would be rated after being read between half an hour and one hour, and the remaining 4% would be rated after being read between 11 and 30 minutes. These estimates are similar to the estimate of less than one hour spent by one REF assessor for reading one of the submitted research outputs if each output was read by two assessors. It has been previously argued that this time spent per output is much less than the one spent by the reviewers who assess a manuscript prior to publication, and indeed, a pre-publication review for a paper takes, on average, 8.5 h (median 5 h) for a typical scientist, and 6.8 h for an active reviewer. However, this is probably because of the relatively more effort needed for devising improvements to the current form of the manuscript and putting them in writing, while, as described above, the average time needed for reading thoroughly an article for the scientist’s own research needs is around one hour.

To estimate the extra time spent by a scientist for providing the rating, we estimate that the extra time spent on one rating is 5 minutes. Taking into account, as above, that a scientist can rate 60 articles per year out of which 6.84% are from the UK; assuming that a full-time job comprises 1950 work hours per year; and overestimating the average cost of the yearly salary of reviewers to the level of a senior-grade academic, £69,410, we get the total cost of providing ratings over 6 years as 60 x 6.84% x 5 / 60 / 1950 x 69,410 x 52,061 x 6 ≈ £3.80M.

According to data from the official REF Accountability Review, the research outputs-related costs of REF can be computed to amount to £78.04M, given that:

The cost of panelists (excluding costs related to impact assessment) was £19M;
Out of the £44M reported for central management and coordination costs (within the HEIs), an average of 40% are reported to be incurred in relation to research outputs, i.e. £17.60M);
Out of the £112M reported for costs at the unit-of-assessment level, excluding costs for impact statements and case studies, an average of 55% was spent on reviewing / negotiating selection of staff and publications, and of 12% for validating / extending bibliographic records for submitted research outputs. The total is 37%, i.e. £41.44M.

Aggregating ratings leads to peer-review-based metrics

Posted by Răzvan Valentin Florian on December 9, 2015

Metrics (quantitative indicators) and peer review are often seen as two opposing paradigms used in research assessment, e.g. in The Metric Tide, a recent report commissioned by the Higher Education Funding Council of England, investigating the role of metrics in research assessment. Peer review is considered as the gold standard for research assessment. However, decision makers need quantitative indicators because these provide the data which is required for optimal allocation of resources to research and for establishing whether these resources were spent efficiently.

In fact, peer review can lead to quantitative indicators that can be used by decision makers. For example, in the UK’s Research Excellence Framework (REF), peer review assigned research outputs to categories, and the percentage of outputs for each category defined a numerical indicator.

REF, however, is costly, requires a complex organization, and its results are not optimal. A better way to define indicators based on expert judgment, which substantiates decisions for allocating resources to research institutions and to scientists, is by aggregating ratings of the scientific publications that are read by scientists for the purpose of their own work.

Every scientist reads thoroughly an average of about 77 scientific articles per year for his or her own research. However, the evaluative information they can provide about these articles is currently lost. Aggregating in an online database ratings of the publications that scientists read provides important information and can revolutionize the evaluation processes that support funding decisions. I have previously estimated that, if each scientist would publish one rating weekly and one post-publication peer review monthly, 52% of publications would get at least 10 ratings, and 46% of publications would get at least 3 reviews. The publications that would get the most ratings and reviews would be the ones that are most read by scientists during their typical research activities.

Online-aggregated ratings are now a major factor in the decisions made by consumers when choosing hotels, restaurants, movies and many other types of services or products. It is paradoxical that in science, a field for which peer review is a cornerstone, rating and reviewing publications on dedicated online platforms, after publication, is not yet a common behaviour.

To achieve this kind of ratings, an appropriate rating scale should be defined. Online ratings typically take the form of a five-star or ten-star discrete scale: this standard has been adopted by major players such as Amazon, Yelp, TripAdvisor and IMDb, and also by the REF. However, these types of scales are not able to accurately measure the quality and importance of scientific publications, due to high skewness of the distribution of its values across publications. Similarly to the distributions of other scientometric indicators, the maximum value could be of about 3 to 5 orders of magnitude larger than the median value. Therefore, a scale of 5, 10 or even 100 discrete categories cannot represent well this variability if the values that the scale represents vary linearly across categories. A solution to this conundrum calls for experts to assess not the absolute value of quality and importance, but its percentile rank. Since raters should be able to express their uncertainty, the rating should be given as an interval of percentile rankings. I have presented the resulting rating scale at this year’s International Society of Scientometrics and Informetrics Conference.

This scale can be used for rating on Epistemio any scientific publication, thereby making metrics based on peer review a reality. Any scientist can publish an assessment of the publications that she / he has read lately in less than one minute, by going to epistemio.com, searching the publication, and adding a rating. About five extra minutes are needed once to sign up, at the first use of the website. Ratings and reviews can be either anonymous or signed, according to authors’ choice. Epistemio hosts freely these ratings and reviews and provides them under an open access licence. The copyright for reviews remains with the authors.

Ratings on one publication given by multiple experts can be aggregated in a distribution. Individual publications can be ranked according to their rating distributions. Distributions corresponding to multiple publications in a set can be aggregated in a distribution characterizing the set. Sets of publications (and, implicitly, the entities defining a set — scientists, units, institutions) can be ranked by directly using these distributions. The public results of REF are similar distributions. Such usage is reasonable if each set includes the same number of top publications of an entity, relative to the entity size, and differences between the typical numbers of publications per scientists among disciplines are taken into account. The last condition is implicitly fulfilled if rankings are performed within disciplines, as in the REF. One may also define a function mapping ratings to absolute values, in order to clarify, e.g., the equivalence between several low-rated publications and a high-rated publication. In this case, selecting a number of top publications of an entity is not necessary. An example of such a function is the relative amount of funding allocated by the UK funding councils for the various REF output categories.

Rating-based indicators solve the most crucial problems typically associated with traditional bibliometrics:

The coverage of citation databases is uneven across disciplines and publication types; in particular, the coverage of arts and humanities is limited. This limits the applicability of citation-based indicators across disciplines and publication types. With rating-based indicators, any type of publication from any field, including arts and humanities, can be assessed on equal grounds. Here, “publication” refers to any type of research output that is publicly available and that can be uniquely identified through a reference, including journal articles, conference papers, book chapters, books, reports, preprints, patents, datasets, software, videos, sounds, recordings of exhibitions and performances, digital or digitalized artefacts, and so on.
Citation-based indicators, when used across fields, need to be field-normalized, with all the associated problems of defining fields. Such normalization is not needed for rating-based indicators if the sets associated with rated entities have an equally small number of publications. There is no bias against interdisciplinary work.
Citations need time to accumulate: the first citations of a new publication appear after a full publication cycle that may take quite some months, up to one or two years. The first ratings of a new publication may appear much faster, in a few days or weeks after it has been published, immediately after the publication is read by specialists in the field.

In a following post, I argue that assessment of research outputs can be achieved better and at a much lower cost using rating-based indicators rather than the current organization of the REF.

A code of conduct for post-publication peer review

Posted by Răzvan Valentin Florian on December 6, 2015

At Epistemio, we believe that post-publication peer review will take an increasingly important role in research assessment. For example, aggregating ratings will lead to peer-review-based metrics of quality and importance of individual publications, eliminating the problems of current indicators indicated, for example, in the San Francisco Declaration on Research Assessment (DORA).

Post-publication peer review will achieve its potential only if it will be performed responsibly and ethically. While there are various codes of conduct for traditional pre-publication peer review, there were no clear guidelines for post-publication peer review.

We have recently developed a code of conduct for post-publication peer review, by adapting the Committee on Publication Ethics (COPE)’s Ethical Guidelines for Peer Reviewers. This code of conduct has already been included in our recently updated Terms of Use, which should be observed by all of our users, including those who post on Epistemio ratings and reviews of the publications they read.

Here is this code of conduct:

Scientists should publish ratings or reviews of a publication only if all of the following apply:

they have the subject expertise required to carry out a proper assessment of the publication;

they do not have any conflict of interest;

they have read the publication thoroughly and with great care.

Situations of conflict of interest include, but are not limited to, any of the following:

working at the same institution as any of the authors of the publication (or planning to join that institution or to apply for a job there);

having been recent (e.g. within the past 3 years) mentors, mentees, close collaborators or joint grant holders with the authors of the publication;

having a close personal relationship with any of the authors of the publication.

Additionally, all of the following should be observed:

the assessment should be based on the merits of the publication and not be influenced, either positively or negatively, by its origins, by the nationality, religious or political beliefs, gender or other characteristics of the authors, by commercial considerations, by any personal, financial, or other conflicting considerations or by intellectual biases;

the assessment should be honest, fair, and reflect the reviewer’s own views;

the review should be objective and constructive;

the reviewer should refrain from being hostile or inflammatory, from making libelous or derogatory personal comments, and from making unfounded accusations or criticisms;

the reviewer should be specific in her/his criticisms, and provide evidence with appropriate references to substantiate critical statements;

the reviewer should be aware of the sensitivities surrounding language issues that are due to the authors writing in a language that is not their own, and phrase the feedback appropriately and with due respect;

if the review or comment is anonymous, the reviewer should not write it in a way that suggests that it has been written by another identifiable person.

Publishing on Epistemio a rating of a scientific publication that you have read lately takes no more than one minute. Here is how you can do it:

Log in or sign up;
Search the publication you would like to rate, for example by typing its title;
Add the rating;
Optionally, add a review that supports your rating.

Replacing the REF assessment of research outputs by a process of collecting peer-provided ratings

A new proposed process

Advantages of the process based on the aggregation of ratings

Significant cost decreases

Towards an international process for aggregating ratings

Other considerations

Appendix

Aggregating ratings leads to peer-review-based metrics

A code of conduct for post-publication peer review

Sign up / Log in

Services

Company

Legal info

Blog & newsletter

Follow us