Footnotes

... binarization.¹

The etymology of this phrase is interesting and unclear. According to The Shorter Oxford English Dictionary (1973), the phrase has at one time or another referred to the fish decapterus punctatus, to the Holy Sacrament (in a blasphemous way), and to written complaints that were signed in a circular way to disguise the order in which the subscribers have signed the petition. A posting on the usenet group uk.culture.language.english (thanks to Foster Provost for this information) claims that its current use in games and sports goes back to these circular signatures, and that the phrase originates from the French word ruban, ribbon. Finally--on a Web-site that is dedicated to analyzing lyrics from the Grateful Dead--Psychology Professor Tom Malloy (Rhode Island College) points out that the use of round robin in the song The Wheel might also refer to a research design used in biology and psychology where pairwise reactions of a group of subjects to each other's behaviors (e.g., smiling) are measured.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

...<j,i>.²

Another approach to tackle this problem could be to ensure that each class is used as the default class in approximately half of the classification problems.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

... class.³

The situation can be compared to a chess player that has to play all of his games with the same color or a football team that is allowed to play all (or none) of its games in the home stadium. This can make a decisive difference and may invalidate the final result of the tournament.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

...UCI.⁴

The restriction to 4 or more classes was made because on 3-class problems, we would expect frequent 3-way ties, which are not yet handled very cleverly. The issue of ties is discussed further below in the paper (Section 7).

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

... cross-validation.⁵

As we can see in Table 2, there are no qualitative differences between hold-out and cross-validation for abalone and sat. For vowel the performance of all algorithms was significantly better in the case of cross-validation, which may indicate that the 528 examples in the original training set are insufficient for learning a good concept.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

... test ⁶

The McNemar test (McNemar, 1947) tests for significant differences of proportions in paired sample designs. It is appropriate for comparing classifiers because it does not assume independent samples and has a comparably low Type I error (Feelders and Verkooijen, 1995; Dietterich, 1998).

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

... error.⁷

As these are relative performance measures, we use a geometric average so that x and 1/x average to 1.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

...n.⁸

This is a very strong assumption that will in general not hold up in practice. Strictly speaking, the time needed for training a classifier does not only depend on the sample size, but also on the domain and the ``difficulty'' of the sample for the learner. Even in the same domain, samples of equal sizes may yield theories of different complexities, and more complex theories will in general require longer training times. One way to incorporate this aspect could be to interpret our complexity functions as average run-times over all possible subsets of this size (in a given domain). However, we believe that making the simplifying assumption of an exact complexity function does not invalidate our claims, which are also backed up by empirical results (Section 5.2).

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

... possible.⁹

A simple algorithm for constructing such a tournament is to fix one player, say the one with index c (assume c is even, add a dummy player if it is not). Then construct a tournament schedule for the first round by pairing player i with player (c+1)-i, i = 1 ...c/2. Subsequent rounds are played with the same pairings, but before each round, the first c-1 entries rotate places, i.e., player i takes over the place of i+1 ( i = 1...c-2) and c-1 moves to 1. It is both, fairly trivial to see that this is correct, and quite easy to put into practice when organizing a round robin chess tournament for kids (let one stick to its seat and all others move around one seat at a time).

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

... tricky.¹⁰

The straightforward proof idea of deriving an upper bound for all regular pairings of each round (excluding the pairing with the dummy class) does not go through because the inequality $\sum_{i=1}^c (n-n_i)^p < (c-1) n^p$ does not hold in general.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

... run-time ¹¹

The complexity of RIPPER's initial rule learning and pruning phase is $O(n\log^2(n))$ (Cohen, 1995; Fürnkranz and Widmer, 1994). Thereafter RIPPER performs two phases of optimization, which--according to the experimental evidence shown in (Cohen, 1995)--only add a constant factor to the overall complexity.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

... times,¹²

Note that the performance ratios for testing times would in general be considerably worse; in some cases, we observed increases in the factors of up to 50%. We will briefly discuss the problem of classification efficiency (and some proposals for solving it) in Section 7.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

... efficient.¹³

The effect is somewhat compensated by the fact that we only report user time (which ignores time for disk access and system time). For example, on the 26-class letter dataset, where R3 writes 26 * 25 = 650 files to the disk, its total run-time is about 15% higher than the reported user time, while there is almost no difference for RIPPER.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

... deterministically.¹⁴

Boosting is also deterministic if the base learner is able to use weighted examples. Often, however, the example weights are interpreted as probabilities which are used for drawing a random sample as the training set for the next boosting iteration.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

... away.¹⁵

The aspect of being able to rank the available classifications for each example (as an intermediate version between predicting only a class value and providing a full probability distribution) is another interesting aspect of round robin binarization, which might be worth exploring in more depth.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.