In this section, we will briefly present an experimental evaluation of round robin binarization in a rule learning context. We chose RIPPER (Cohen, 1995) as the base classifier, which--in our view--is the most advanced algorithm of the family of separate-and-conquer (or covering) rule learning algorithms (Fürnkranz, 1997; Fürnkranz, 1999b).
The unordered and ordered binarization procedures were used as implemented within RIPPER. Ordered binarization is RIPPER's default mode, and unordered binarization can be invoked using the option -a unordered. The round robin binarization was implemented as a wrapper around RIPPER, which provided it with the appropriate training sets. The wrapper was implemented in perl and had to communicate with RIPPER by writing the training sets to and reading RIPPER's results from the disk. This implementation is referred to as R3 (round robin ripper).
12pt In principle, RIPPER is a class-symmetric learner. It will treat the larger class of a two-class problem as the default class and learn rules for the smaller class. Although this procedure is class-symmetric (problem <i,j> is converted to <j,i> if ), we felt that it would not be fair. For example, the largest class in the multi-class problem would be used as the default class in all round robin problems. This may be an unfair advantage (or disadvantage) to this class.3 For this reason, R3 implements a double round robin binarization which calls RIPPER with the option -a given on each binary problem <i,j> (i,j= 1 ...c, i j). This instructs RIPPER to use the classes in the order specified by the user (i.e., the order in which they appear in the names file). Hence, <i,j> and <j,i> are two different problems, which ensures that each class is the default class in exactly half of its binary classification problems. Note, that this procedure is basically identical to the one that is employed by RIPPER if it is used in unordered mode on a two-class problem except that RIPPER would tie-break immediately between the theories learned for <i,j> and <j,i>, while we first collect all votes from all c(c-1) binary problems.
RIPPER | R3 | McNemar | ||||
dataset | unord. | ratio | ordered | ratio | test | |
abalone | 81.64 | 0.911 | 81.18 | 0.916 | 74.34 | ++ |
car | 5.79 | 0.390 | 12.15 | 0.186 | 2.26 | ++ |
glass | 35.51 | 0.724 | 34.58 | 0.743 | 25.70 | ++ |
image | 4.15 | 0.823 | 4.29 | 0.808 | 3.46 | + |
lr spectrometer | 64.22 | 0.827 | 61.39 | 0.865 | 53.11 | ++ |
optical | 7.79 | 0.479 | 9.48 | 0.394 | 3.74 | ++ |
page-blocks | 2.85 | 0.968 | 3.38 | 0.816 | 2.76 | ++ |
sat | 13.18 | 0.785 | 13.04 | 0.794 | 10.35 | ++ |
solar flares (c) | 15.91 | 0.991 | 15.91 | 0.991 | 15.77 | = |
solar flares (m) | 4.90 | 1.029 | 5.47 | 0.921 | 5.04 | = |
soybean | 8.79 | 0.717 | 8.79 | 0.717 | 6.30 | ++ |
thyroid (hyper) | 1.25 | 0.893 | 1.49 | 0.749 | 1.11 | + |
thyroid (hypo) | 0.64 | 0.833 | 0.56 | 0.955 | 0.53 | = |
thyroid (repl.) | 1.17 | 0.863 | 0.98 | 1.026 | 1.01 | = |
vehicle | 28.25 | 1.029 | 30.38 | 0.957 | 29.08 | = |
vowel | 30.50 | 0.633 | 27.07 | 0.690 | 18.69 | ++ |
yeast | 44.00 | 0.949 | 42.39 | 0.986 | 41.78 | = |
average | 0.787 | 0.747 | ||||
abalone | 81.03 | 0.901 | 82.18 | 0.888 | 72.99 | ++ |
covertype | 35.37 | 0.939 | 38.50 | 0.862 | 33.20 | ++ |
letter | 15.22 | 0.516 | 15.75 | 0.498 | 7.85 | ++ |
sat | 14.25 | 0.782 | 17.05 | 0.654 | 11.15 | ++ |
shuttle | 0.03 | 0.667 | 0.06 | 0.375 | 0.02 | = |
vowel | 64.94 | 0.823 | 53.25 | 1.004 | 53.46 | = |
Table 1 shows the 20 datasets we used in this study. They were chosen arbitrarily among datasets with 4 classes available at the UCI repository (Blake and Merz, 1998).4 The implementation of the algorithm was developed independently and not tuned on these datasets. On the six sets with a dedicated test set, we report the error rate on these test sets. On the other 14 sets, we estimated the error rate using paired, stratified 10-fold cross-validations. For abalone, sat and vowel we performed both a test set evaluation and a cross-validation.5
Table 2 shows the accuracies of RIPPER (unordered and ordered) and R3 on the selected datasets. On half of the 20 experiments (not counting the cross-validated trials of the three sets the re-appear at the bottom), R3 is significantly better (p > 0.99 on a McNemar test6) than RIPPER's default mode (ordered binarization). There were only two experiments (thyroid (repl.) and the test-set version of vowel), where R3 is worse than RIPPER, both differences being insignificant. On the 17 cross-validated problems, R3 reduces the average error to 75% of the error of RIPPER's error.7 The comparison to unordered RIPPER is similar (the significance levels for this case are not shown).
We can safely conclude that round robin binarization may result in significant improvements over ordered or unordered binarization without having a high risk of decreasing performance.