This article is contributed. See the original author and article here:

Imagine that we have some non-trivial subsystem – e.g. product retrieval for user query – and need to know how good it is and decide if it is the bottleneck for the overall system. Or we have 2 competing solutions and need to choose the better one.

It is generally accepted that to make such decisions, we need to measure the subsystem; to do the measurement, in turn, we need ground truth data – in most cases, labeled by humans or crowdsourced labelers. Depending on our main goal and task specifics, one labeling setup may be better than another.

This post discusses ground truth labeling setups for ranking and helps to choose the most appropriate one for a use-case. Here by ranking we mean a task to establish an order on multiple items. It is common for many applications, such as search engines (for products/images/webpages) or recommendation systems, where ranking is done by relevance to a user query or a user profile. Ranking can be also based on universal properties of items to be ranked, e.g. ranking images by attractiveness.


Setups of ground truth data labeling

One can distinguish three main setups for ground truth data labeling:

  1. Absolute Gain Estimation – each item is labeled independently from each other on an established absolute scale (usually from 0 to 1); note that in case of relevance-based ranking, item includes both constituents, e.g. a pair of a user query and an image to be ranked.
  2. Relative ordering – labeler sees all items and directly introduces order on them.
  3. Side-by-side – labeler compares just 2 items at once and provides label reflecting result of such comparison; the label may be binary or multivalued. 

For example, if all we need is to compare 2 subsystems and it is relatively easy to understand which of 2 items is better, then Side-by-side is the way to go. In another case, when we try to answer how far the current ranking subsystem is from the ideal one, then Absolute gain estimation may be a better choice.

See a detailed comparison of these setups in the following table:


Absolute Gain Estimation (AGE)

Relative ordering (RO)

Side-by-Side (SBS)


Having just one item (per query), return number reflecting quality of this item

Introduce an order on collection of items (per query)

Compare 2 items (per query) and decide which item is better (maybe, with scale – how much better)

Where can be used (tasks)

1. Estimate how far we are from ideal

2. Estimate priority of the task

3. Estimate ROI 

4. Estimate not only quality of ranking subsystem, but also quality of the items to be ranked (say, if all items have score around 0.1, then even perfect ranking wouldn’t improve overall quality and better items need to be added to the system)

5. Training data collection

6. Compare any number of systems

1. Training data collection

2. Compare any number of systems

1. Compare Prod vs New (to decide if ship new model or not)

2. Compare system with competitor

3. Train gains for levels (by Maximum Likelihood estimation)

How can be implemented

(See details in the section below)

1. Predefined levels (Excellent/Good/Fair/Bad)

2. Slider-like



1. Best-worst scaling

2. Direct swapping of items until needed order is achieved


Show two items and choose one of predefined levels (e.g. on Likert scale or on 1-100 slider)

Possible metrics



MRR (binary)

Rank correlation coefficients (e.g. Kendall-tau)

Win/loss ratios (possibly, weighted): (wins-losses)/(wins+losses+ties)

McNemar’s test

Main pros


Easy to combine with other measurements


Most similar to actual ranking, thus best for corresponding tasks (low overhead, high speed, etc.)

Most similar to actual comparison, thus best for corresponding tasks (high sensitivity, low overhead, high speed, etc.)

Can include scale (e.g. Likert)

Main cons

Hard to define and judge – needed to describe/imagine Ideal and Worst items for each query – as a result, worse sensitivity for Training and Comparison

Can’t differentiate scenarios with 2 items: 1-0.1 vs 0.6-0.5

Can hardly support labeling of big number of items (say, greater than 10)

By design, requires 2 items to compare.

There may be hybrids of them, e.g.

  1. RO + AGE: firstly, do Relative ordering, then assign scale for all items by doing costly AGE for the best and the worst items for the query.
  2. SBS + AGE: do Side-by-side with items that have known absolute gains (e.g. 0.9, 0.5, 0.1) of another query (harder to compare, esp. medium cases)

Note that we can use different Setups for different tasks, such as use AGE for measurements, but RO for gathering training data. Simple way to do this is to assign uniform gains after ordering (1 for the top, 0 for the last, etc.), then train ranker on this. This is not ideal but can be better in case of too few levels for the data or too costly AGE labeling.


Implementations of Absolute Gain Estimation

There are 2 general methods to obtain non-binary ranking-like labels:


Predefined levels


UX Implementation

Radio buttons


Number of distinct levels



Levels naming

Named, e.g. Perfect/Good/Fair/Bad

Not named


Clear, distinctive

Short, general, unbiased

Judges’ requirements



Judgements per item

2-3 per item

5-10 per item


Alpha Krippendorff’s

Correlation coefficient

Speed of judgments









Stability (anti-variance)



For which tasks to use

Complex, formalizable, well-defined, homogenous

Subjective, heterogenous



TREC collections

Most of scientific papers in Information retrieval; in most cases, just binary: relevant/irrelevant

Machine Translation in Bing: “we had people just score the translation on a scale, like, just a slider, really, and tell us how good the translation was. So, it was a very simple set of instructions that the evaluators got. And the reason we do that is so that we can get very consistent results and people can understand the instructions.”

Again, there may be combinations of them, e.g. firstly solve binary problem (Bad vs Not bad) by Predefined levels method, then solve quality/attractiveness problem, which usually are more subjective, by Slider-like method.

Also, slider method can be used as a preliminary step to construct Predefined levels method:

  1. Ask multiple judges to rank items and then to explain their choices.
  2. Infer levels and their definitions from these explanations (e.g. cluster them and analyze).

To sum up, it may be beneficial to firstly outline the most important questions to be answered by measurement, then to collect all task specifics – what is the type of items, how hard to define ideal item, etc. – in order to finally design the most suitable measurement strategy.