In this section, we will briefly present an experimental evaluation of round robin binarization in a rule learning context. We chose RIPPER (Cohen, 1995) as the base classifier, which--in our view--is the most advanced algorithm of the family of separate-and-conquer (or covering) rule learning algorithms (Fürnkranz, 1997; Fürnkranz, 1999b).
The unordered and ordered binarization procedures were used as implemented within RIPPER. Ordered binarization is RIPPER's default mode, and unordered binarization can be invoked using the option -a unordered. The round robin binarization was implemented as a wrapper around RIPPER, which provided it with the appropriate training sets. The wrapper was implemented in perl and had to communicate with RIPPER by writing the training sets to and reading RIPPER's results from the disk. This implementation is referred to as R3 (round robin ripper).
12pt In principle, RIPPER is a class-symmetric
learner. It will treat the larger class of a two-class problem as the
default class and learn rules for the smaller class. Although this
procedure is class-symmetric (problem <i,j> is converted to
<j,i> if
), we felt that it would not be
fair. For example, the largest class in the multi-class problem would
be used as the default class in all round robin problems. This may be
an unfair advantage (or disadvantage) to this class.3 For this reason, R3 implements a double round
robin binarization which calls RIPPER with the option
-a given on each binary problem <i,j> (i,j= 1
...c, i
j). This
instructs RIPPER to use the classes in the order specified by the
user (i.e., the order in which they appear in the names file). Hence,
<i,j> and <j,i> are two different problems, which
ensures that each class is the default class in exactly half of its
binary classification problems. Note, that this procedure is basically
identical to the one that is employed by RIPPER if it is used in
unordered mode on a two-class problem except that RIPPER would
tie-break immediately between the theories learned for <i,j>
and <j,i>, while we first collect all votes from all
c(c-1) binary problems.
RIPPER | R3 | McNemar | ||||
dataset | unord. | ratio | ordered | ratio | test | |
abalone | 81.64 | 0.911 | 81.18 | 0.916 | 74.34 | ++ |
car | 5.79 | 0.390 | 12.15 | 0.186 | 2.26 | ++ |
glass | 35.51 | 0.724 | 34.58 | 0.743 | 25.70 | ++ |
image | 4.15 | 0.823 | 4.29 | 0.808 | 3.46 | + |
lr spectrometer | 64.22 | 0.827 | 61.39 | 0.865 | 53.11 | ++ |
optical | 7.79 | 0.479 | 9.48 | 0.394 | 3.74 | ++ |
page-blocks | 2.85 | 0.968 | 3.38 | 0.816 | 2.76 | ++ |
sat | 13.18 | 0.785 | 13.04 | 0.794 | 10.35 | ++ |
solar flares (c) | 15.91 | 0.991 | 15.91 | 0.991 | 15.77 | = |
solar flares (m) | 4.90 | 1.029 | 5.47 | 0.921 | 5.04 | = |
soybean | 8.79 | 0.717 | 8.79 | 0.717 | 6.30 | ++ |
thyroid (hyper) | 1.25 | 0.893 | 1.49 | 0.749 | 1.11 | + |
thyroid (hypo) | 0.64 | 0.833 | 0.56 | 0.955 | 0.53 | = |
thyroid (repl.) | 1.17 | 0.863 | 0.98 | 1.026 | 1.01 | = |
vehicle | 28.25 | 1.029 | 30.38 | 0.957 | 29.08 | = |
vowel | 30.50 | 0.633 | 27.07 | 0.690 | 18.69 | ++ |
yeast | 44.00 | 0.949 | 42.39 | 0.986 | 41.78 | = |
average | 0.787 | 0.747 | ||||
abalone | 81.03 | 0.901 | 82.18 | 0.888 | 72.99 | ++ |
covertype | 35.37 | 0.939 | 38.50 | 0.862 | 33.20 | ++ |
letter | 15.22 | 0.516 | 15.75 | 0.498 | 7.85 | ++ |
sat | 14.25 | 0.782 | 17.05 | 0.654 | 11.15 | ++ |
shuttle | 0.03 | 0.667 | 0.06 | 0.375 | 0.02 | = |
vowel | 64.94 | 0.823 | 53.25 | 1.004 | 53.46 | = |
Table 1 shows the 20 datasets we used in this study.
They were chosen arbitrarily among datasets with 4 classes
available at the UCI repository (Blake and Merz, 1998).4 The implementation of the algorithm was
developed independently and not tuned on these datasets. On the six
sets with a dedicated test set, we report the error rate on these test
sets. On the other 14 sets, we estimated the error rate using paired,
stratified 10-fold cross-validations. For abalone,
sat and vowel we performed both a test set
evaluation and a cross-validation.5
Table 2 shows the accuracies of RIPPER (unordered and ordered) and R3 on the selected datasets. On half of the 20 experiments (not counting the cross-validated trials of the three sets the re-appear at the bottom), R3 is significantly better (p > 0.99 on a McNemar test6) than RIPPER's default mode (ordered binarization). There were only two experiments (thyroid (repl.) and the test-set version of vowel), where R3 is worse than RIPPER, both differences being insignificant. On the 17 cross-validated problems, R3 reduces the average error to 75% of the error of RIPPER's error.7 The comparison to unordered RIPPER is similar (the significance levels for this case are not shown).
We can safely conclude that round robin binarization may result in significant improvements over ordered or unordered binarization without having a high risk of decreasing performance.