How to measure ranking system: Three setups of ground truth data labeling

This article is contributed. See the original author and article here.

Imagine that we have some non-trivial subsystem – e.g. product retrieval for user query – and need to know how good it is and decide if it is the bottleneck for the overall system. Or we have 2 competing solutions and need to choose the better one.

It is generally accepted that to make such decisions, we need to measure the subsystem; to do the measurement, in turn, we need ground truth data – in most cases, labeled by humans or crowdsourced labelers. Depending on our main goal and task specifics, one labeling setup may be better than another.

This post discusses ground truth labeling setups for ranking and helps to choose the most appropriate one for a use-case. Here by ranking we mean a task to establish an order on multiple items. It is common for many applications, such as search engines (for products/images/webpages) or recommendation systems, where ranking is done by relevance to a user query or a user profile. Ranking can be also based on universal properties of items to be ranked, e.g. ranking images by attractiveness.

Setups of ground truth data labeling

One can distinguish three main setups for ground truth data labeling:

Absolute Gain Estimation – each item is labeled independently from each other on an established absolute scale (usually from 0 to 1); note that in case of relevance-based ranking, item includes both constituents, e.g. a pair of a user query and an image to be ranked.
Relative ordering – labeler sees all items and directly introduces order on them.
Side-by-side – labeler compares just 2 items at once and provides label reflecting result of such comparison; the label may be binary or multivalued.

For example, if all we need is to compare 2 subsystems and it is relatively easy to understand which of 2 items is better, then Side-by-side is the way to go. In another case, when we try to answer how far the current ranking subsystem is from the ideal one, then Absolute gain estimation may be a better choice.

See a detailed comparison of these setups in the following table:

Setup	Absolute Gain Estimation (AGE)	Relative ordering (RO)	Side-by-Side (SBS)
Description	Having just one item (per query), return number reflecting quality of this item	Introduce an order on collection of items (per query)	Compare 2 items (per query) and decide which item is better (maybe, with scale – how much better)
Where can be used (tasks)	1. Estimate how far we are from ideal 2. Estimate priority of the task 3. Estimate ROI 4. Estimate not only quality of ranking subsystem, but also quality of the items to be ranked (say, if all items have score around 0.1, then even perfect ranking wouldn’t improve overall quality and better items need to be added to the system) 5. Training data collection 6. Compare any number of systems	1. Training data collection 2. Compare any number of systems	1. Compare Prod vs New (to decide if ship new model or not) 2. Compare system with competitor 3. Train gains for levels (by Maximum Likelihood estimation)
How can be implemented	(See details in the section below) 1. Predefined levels (Excellent/Good/Fair/Bad) 2. Slider-like	1. Best-worst scaling 2. Direct swapping of items until needed order is achieved	Show two items and choose one of predefined levels (e.g. on Likert scale or on 1-100 slider)
Possible metrics	(n)DCG MAP MRR (binary)	Rank correlation coefficients (e.g. Kendall-tau)	Win/loss ratios (possibly, weighted): (wins-losses)/(wins+losses+ties) McNemar’s test
Main pros	Universal Easy to combine with other measurements	Most similar to actual ranking, thus best for corresponding tasks (low overhead, high speed, etc.)	Most similar to actual comparison, thus best for corresponding tasks (high sensitivity, low overhead, high speed, etc.) Can include scale (e.g. Likert)
Main cons	Hard to define and judge – needed to describe/imagine Ideal and Worst items for each query – as a result, worse sensitivity for Training and Comparison	Can’t differentiate scenarios with 2 items: 1-0.1 vs 0.6-0.5 Can hardly support labeling of big number of items (say, greater than 10)	By design, requires 2 items to compare.

There may be hybrids of them, e.g.

RO + AGE: firstly, do Relative ordering, then assign scale for all items by doing costly AGE for the best and the worst items for the query.
SBS + AGE: do Side-by-side with items that have known absolute gains (e.g. 0.9, 0.5, 0.1) of another query (harder to compare, esp. medium cases)

Note that we can use different Setups for different tasks, such as use AGE for measurements, but RO for gathering training data. Simple way to do this is to assign uniform gains after ordering (1 for the top, 0 for the last, etc.), then train ranker on this. This is not ideal but can be better in case of too few levels for the data or too costly AGE labeling.

Implementations of Absolute Gain Estimation

There are 2 general methods to obtain non-binary ranking-like labels:

Method	Predefined levels	Slider-like
UX Implementation	Radio buttons	Slider
Number of distinct levels	2-5	10-100
Levels naming	Named, e.g. Perfect/Good/Fair/Bad	Not named
Guidelines	Clear, distinctive	Short, general, unbiased
Judges’ requirements	High	Medium
Judgements per item	2-3 per item	5-10 per item
Agreement	Alpha Krippendorff’s	Correlation coefficient
Speed of judgments	Low	Medium
Interpretability	Medium	Low
Flexibility	Medium	High
Stability (anti-variance)	Medium	Low
For which tasks to use	Complex, formalizable, well-defined, homogenous	Subjective, heterogenous
Examples	TREC collections Most of scientific papers in Information retrieval; in most cases, just binary: relevant/irrelevant	Machine Translation in Bing: “we had people just score the translation on a scale, like, just a slider, really, and tell us how good the translation was. So, it was a very simple set of instructions that the evaluators got. And the reason we do that is so that we can get very consistent results and people can understand the instructions.”

Again, there may be combinations of them, e.g. firstly solve binary problem (Bad vs Not bad) by Predefined levels method, then solve quality/attractiveness problem, which usually are more subjective, by Slider-like method.

Also, slider method can be used as a preliminary step to construct Predefined levels method:

Ask multiple judges to rank items and then to explain their choices.
Infer levels and their definitions from these explanations (e.g. cluster them and analyze).

To sum up, it may be beneficial to firstly outline the most important questions to be answered by measurement, then to collect all task specifics – what is the type of items, how hard to define ideal item, etc. – in order to finally design the most suitable measurement strategy.

Brought to you by Dr. Ware, Microsoft Office 365 Silver Partner, Charleston SC.

How to measure ranking system: Three setups of ground truth data labeling

Setups of ground truth data labeling

Implementations of Absolute Gain Estimation

Submit a Comment Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Meta

We look forward to meeting you