A Bayesian Bradley-Terry model to compare multiple ML algorithms on multiple data sets

Jacques Wainer

his paper presents a Bayesian model, called the Bayesian Bradley Terry (BBT) model, for comparing multiple algorithms on multiple data sets based on any metric. The model is an extension of the Bradley Terry model, which tracks the number of wins each algorithm has on different data sets. Unlike frequentist methods such as Demsar tests on mean rank or multiple pairwise Wilcoxon tests, the Bayesian approach provides a more nuanced understanding of the algorithms’ performance and allows for the definition of the “region of practical equivalence” (ROPE) for two algorithms. Additionally, the paper introduces the concept of “local ROPE,” which assesses the significance of the difference in mean measure between two algorithms using effect sizes, and can be applied in frequentist approaches as well. Both an R package and a Python program implementing the BBT are available for use.

A Bayesian Bradley-Terry model to compare multiple ML algorithms on multiple data sets

Abstract