Aggregating ratings leads to peer-review-based metrics

Posted by on December 9, 2015

blog-banner-metricsMetrics (quantitative indicators) and peer review are often seen as two opposing paradigms used in research assessment, e.g. in The Metric Tide, a recent  report commissioned by the Higher Education Funding Council of England, investigating the role of metrics in research assessment. Peer review is considered as the gold standard for research assessment. However, decision makers need quantitative indicators because these provide the data which is required for optimal allocation of resources to research and for establishing whether these resources were spent efficiently.

In fact, peer review can lead to quantitative indicators that can be used by decision makers. For example, in the UK’s Research Excellence Framework (REF), peer review assigned research outputs to categories, and the percentage of outputs for each category defined a numerical indicator.

REF, however, is costly, requires a complex organization, and its results are not optimal. A better way to define indicators based on expert judgment, which substantiates decisions for allocating resources to research institutions and to scientists, is by aggregating ratings of the scientific publications that are read by scientists for the purpose of their own work.

Every scientist reads thoroughly an average of about 77 scientific articles per year for his or her own research. However, the evaluative information they can provide about these articles is currently lost. Aggregating in an online database ratings of the publications that scientists read provides important information and can revolutionize the evaluation processes that support funding decisions. I have previously estimated that, if each scientist would publish one rating weekly and one post-publication peer review monthly, 52% of publications would get at least 10 ratings, and 46% of publications would get at least 3 reviews. The publications that would get the most ratings and reviews would be the ones that are most read by scientists during their typical research activities.

Online-aggregated ratings are now a major factor in the decisions made by consumers when choosing hotels, restaurants, movies and many other types of services or products. It is paradoxical that in science, a field for which peer review is a cornerstone, rating and reviewing publications on dedicated online platforms, after publication, is not yet a common behaviour.

To achieve this kind of ratings, an appropriate rating scale should be defined. Online ratings typically take the form of a five-star or ten-star discrete scale: this standard has been adopted by major players such as Amazon, Yelp, TripAdvisor and IMDb, and also by the REF. However, these types of scales are not able to accurately measure the quality and importance of scientific publications, due to high skewness of the distribution of its values across publications. Similarly to the distributions of other scientometric indicators, the maximum value could be of about 3 to 5 orders of magnitude larger than the median value. Therefore, a scale of 5, 10 or even 100 discrete categories cannot represent well this variability if the values that the scale represents vary linearly across categories. A solution to this conundrum calls for experts to assess not the absolute value of quality and importance, but its percentile rank. Since raters should be able to express their uncertainty, the rating should be given as an interval of percentile rankings. I have presented the resulting rating scale at this year’s International Society of Scientometrics and Informetrics Conference.

This scale can be used for rating on Epistemio any scientific publication, thereby making metrics based on peer review a reality. Any scientist can publish an assessment of the publications that she / he has read lately in less than one minute, by going to, searching the publication, and adding a rating. About five extra minutes are needed once to sign up, at the first use of the website. Ratings and reviews can be either anonymous or signed, according to authors’ choice.  Epistemio hosts freely these ratings and reviews and provides them under an open access licence. The copyright for reviews remains with the authors.

Ratings on one publication given by multiple experts can be aggregated in a distribution. Individual publications can be ranked according to their rating distributions. Distributions corresponding to multiple publications in a set can be aggregated in a distribution characterizing the set. Sets of publications (and, implicitly, the entities defining a set — scientists, units, institutions) can be ranked by directly using these distributions. The public results of REF are similar distributions. Such usage is reasonable if each set includes the same number of top publications of an entity, relative to the entity size, and differences between the typical numbers of publications per scientists among disciplines are taken into account. The last condition is implicitly fulfilled if rankings are performed within disciplines, as in the REF. One may also define a function mapping ratings to absolute values, in order to clarify, e.g., the equivalence between several low-rated publications and a high-rated publication. In this case, selecting a number of top publications of an entity is not necessary. An example of such a function is the relative amount of funding allocated by the UK funding councils for the various REF output categories.

Rating-based indicators solve the most crucial problems typically associated with traditional bibliometrics:

  • The coverage of citation databases is uneven across disciplines and publication types; in particular, the coverage of arts and humanities is limited. This limits the applicability of citation-based indicators across disciplines and publication types. With rating-based indicators, any type of publication from any field, including arts and humanities, can be assessed on equal grounds. Here, “publication” refers to any type of research output that is publicly available and that can be uniquely identified through a reference, including journal articles, conference papers, book chapters, books, reports, preprints, patents, datasets, software, videos, sounds, recordings of exhibitions and performances, digital or digitalized artefacts, and so on.
  • Citation-based indicators, when used across fields, need to be field-normalized, with all the associated problems of defining fields. Such normalization is not needed for rating-based indicators if the sets associated with rated entities have an equally small number of publications. There is no bias against interdisciplinary work.
  • Citations need time to accumulate: the first citations of a new publication appear after a full publication cycle that may take quite some months, up to one or two years. The first ratings of a new publication may appear much faster, in a few days or weeks after it has been published, immediately after the publication is read by specialists in the field.

In a following post, I argue that assessment of research outputs can be achieved better and at a much lower cost using rating-based indicators rather than the current organization of the REF.