Ensemble techniques have received considerable attention within the recent machine learning literature (Bauer and Kohavi, 1999; Dietterich, 2000a;1997; Opitz and Maclin, 1999). The idea to obtain a diverse set of classifiers for a single learning problem and to vote or average their predictions is both simple and powerful, and the obtained accuracy gains often have a sound theoretical foundation (Breiman, 1996; Freund and Schapire, 1997). Averaging the predictions of these classifiers helps to reduce the variance and often increase the reliability of the predictions. There are several techniques for obtaining a diverse set of classifiers. The most common technique is to use subsampling to diversify the training sets as in bagging (Breiman, 1996) and boosting (Freund and Schapire, 1997). Other techniques include the use of different feature subsets (Bay, 1999), to exploit the randomness of the base algorithms (Kolen and Pollack, 1991), possibly by artificially randomizing their behavior (Dietterich, 2000b), or to use multiple representations of the domain objects, for example by using information originating from different hyperlinks pointing to a web page (Fürnkranz, 2001a; Fürnkranz, 1999a). Finally, classifier diversity can be ensured by modifying the output labels, i.e., by transforming the learning tasks into a collection of related learning tasks that use the same input examples but a different assignments of the class labels. Error-correcting output codes are the most prominent example for this type of ensemble method (Dietterich and Bakiri, 1995).
Clearly, round robin classification may also be interpreted as a member of this last group, and its performance gain may be seen in this context. Obviously, the final prediction is made by exploiting the redundancy provided by multiple models, each of them being constructed from a subset of the original data. However, contrary to subsampling approaches like bagging and boosting, these datasets are constructed deterministically.14 In this respect pairwise classification shares more similarities with error-correcting output codes, but differs from it through the fixed procedure for setting up the new binary problems and the fact that each of the new problems is smaller than the original problem. In particular the latter fact may often cause the sub-problems in pairwise classification to be conceptually simpler than the original problem (as illustrated in Figure 1).
In previous work (Fürnkranz, 2001b), we observed that the improvements in accuracy obtained by R3 over RIPPER were quite similar to those obtained by C5.0-BOOST over C5.0 on the same problems. Round robin binarization seemed to work whenever boosting worked, and vice versa. The correlation coefficient of the the error ratios of C5.0-BOOST/C5.0 and R3/RIPPER on the 20 datasets was about 0.618, which is in the same range as correlation coefficients for bagging and boosting (Opitz and Maclin, 1999). We interpreted this as weak evidence that the performance gains of round robin learning may be comparable to that of other ensemble methods and that it could be used as a general method for improving a learner's performance on multi-class problems. We will further investigate this question in this section and will in particular focus upon a comparison of round robin learning with boosting (Section 6.1) and bagging (Section 6.2), and upon the potential of combining it with these techniques.
dataset | C5.0 | round robin | boosting | both | |||
abalone | 78.48 | 75.08 | 0.957 | 77.88 | 0.992 | 74.67 | 0.951 |
car | 7.58 | 5.84 | 0.771 | 3.82 | 0.504 | 1.85 | 0.244 |
glass | 35.05 | 24.77 | 0.707 | 27.57 | 0.787 | 22.90 | 0.653 |
image | 3.20 | 2.90 | 0.905 | 1.60 | 0.500 | 1.73 | 0.541 |
lr spectrometer | 51.22 | 51.79 | 1.011 | 46.70 | 0.912 | 51.98 | 1.015 |
optical | 9.20 | 5.04 | 0.547 | 2.46 | 0.267 | 2.54 | 0.277 |
page-blocks | 3.09 | 2.98 | 0.964 | 2.58 | 0.834 | 2.78 | 0.899 |
sat | 13.82 | 13.16 | 0.953 | 9.32 | 0.675 | 9.00 | 0.651 |
solar flares (c) | 15.77 | 15.69 | 0.995 | 16.41 | 1.041 | 16.70 | 1.059 |
solar flares (m) | 4.90 | 4.90 | 1.000 | 5.90 | 1.206 | 5.83 | 1.191 |
soybean | 9.66 | 6.73 | 0.697 | 6.59 | 0.682 | 6.44 | 0.667 |
thyroid (hyper) | 1.11 | 1.14 | 1.024 | 1.03 | 0.929 | 1.33 | 1.190 |
thyroid (hypo) | 0.58 | 0.69 | 1.182 | 0.32 | 0.545 | 0.53 | 0.909 |
thyroid (repl.) | 0.72 | 0.74 | 1.037 | 0.90 | 1.259 | 0.90 | 1.259 |
vehicle | 26.24 | 29.20 | 1.113 | 24.11 | 0.919 | 23.17 | 0.883 |
vowel | 21.72 | 19.49 | 0.898 | 8.89 | 0.409 | 14.75 | 0.679 |
yeast | 43.26 | 40.63 | 0.939 | 41.85 | 0.967 | 40.77 | 0.942 |
average | 0.909 | 0.735 | 0.757 |