Accuracy

In this section, we will briefly present an experimental evaluation of round robin binarization in a rule learning context. We chose RIPPER (Cohen, 1995) as the base classifier, which--in our view--is the most advanced algorithm of the family of separate-and-conquer (or covering) rule learning algorithms (Fürnkranz, 1997; Fürnkranz, 1999b).

The unordered and ordered binarization procedures were used as implemented within RIPPER. Ordered binarization is RIPPER's default mode, and unordered binarization can be invoked using the option -a unordered. The round robin binarization was implemented as a wrapper around RIPPER, which provided it with the appropriate training sets. The wrapper was implemented in perl and had to communicate with RIPPER by writing the training sets to and reading RIPPER's results from the disk. This implementation is referred to as R3 (round robin ripper).

12pt In principle, RIPPER is a class-symmetric learner. It will treat the larger class of a two-class problem as the default class and learn rules for the smaller class. Although this procedure is class-symmetric (problem <i,j> is converted to <j,i> if $\vert c_i\vert > \vert c_j\vert$ ), we felt that it would not be fair. For example, the largest class in the multi-class problem would be used as the default class in all round robin problems. This may be an unfair advantage (or disadvantage) to this class.³ For this reason, R3 implements a double round robin binarization which calls RIPPER with the option -a given on each binary problem <i,j> (i,j= 1 ...c, i $\neq$ j). This instructs RIPPER to use the classes in the order specified by the user (i.e., the order in which they appear in the names file). Hence, <i,j> and <j,i> are two different problems, which ensures that each class is the default class in exactly half of its binary classification problems. Note, that this procedure is basically identical to the one that is employed by RIPPER if it is used in unordered mode on a two-class problem except that RIPPER would tie-break immediately between the theories learned for <i,j> and <j,i>, while we first collect all votes from all c(c-1) binary problems.

**Table:** *Error rates*: The first two column pairs show the results of RIPPER (in unordered and in default, ordered mode). For both, we show the absolute error rate, and the improvement rate of R3 over the algorithm. R3's error rates are shown in the 5th column. The last column shows whether the difference between R3 and default RIPPER (ordered) is significant (++ if *p > 0.99*, + if *p > 0.95*, McNemar test). The first part of the table shows the results for cross-validation (17 sets). To allow a comparison with previous works, the last 6 lines show hold-out estimates for the six datasets with a designated test set (3 datasets occur in both blocks). The line in the middle shows the geometric averages of the improvement ratios.
	RIPPER				R3	McNemar
dataset	unord.	ratio	ordered	ratio		test
abalone	81.64	0.911	81.18	0.916	74.34	++
car	5.79	0.390	12.15	0.186	2.26	++
glass	35.51	0.724	34.58	0.743	25.70	++
image	4.15	0.823	4.29	0.808	3.46	+
lr spectrometer	64.22	0.827	61.39	0.865	53.11	++
optical	7.79	0.479	9.48	0.394	3.74	++
page-blocks	2.85	0.968	3.38	0.816	2.76	++
sat	13.18	0.785	13.04	0.794	10.35	++
solar flares (c)	15.91	0.991	15.91	0.991	15.77	=
solar flares (m)	4.90	1.029	5.47	0.921	5.04	=
soybean	8.79	0.717	8.79	0.717	6.30	++
thyroid (hyper)	1.25	0.893	1.49	0.749	1.11	+
thyroid (hypo)	0.64	0.833	0.56	0.955	0.53	=
thyroid (repl.)	1.17	0.863	0.98	1.026	1.01	=
vehicle	28.25	1.029	30.38	0.957	29.08	=
vowel	30.50	0.633	27.07	0.690	18.69	++
yeast	44.00	0.949	42.39	0.986	41.78	=
average		0.787		0.747
abalone	81.03	0.901	82.18	0.888	72.99	++
covertype	35.37	0.939	38.50	0.862	33.20	++
letter	15.22	0.516	15.75	0.498	7.85	++
sat	14.25	0.782	17.05	0.654	11.15	++
shuttle	0.03	0.667	0.06	0.375	0.02	=
vowel	64.94	0.823	53.25	1.004	53.46	=

Table 1 shows the 20 datasets we used in this study. They were chosen arbitrarily among datasets with $\geq$ 4 classes available at the UCI repository (Blake and Merz, 1998).⁴ The implementation of the algorithm was developed independently and not tuned on these datasets. On the six sets with a dedicated test set, we report the error rate on these test sets. On the other 14 sets, we estimated the error rate using paired, stratified 10-fold cross-validations. For abalone, sat and vowel we performed both a test set evaluation and a cross-validation.⁵

Table 2 shows the accuracies of RIPPER (unordered and ordered) and R3 on the selected datasets. On half of the 20 experiments (not counting the cross-validated trials of the three sets the re-appear at the bottom), R3 is significantly better (p > 0.99 on a McNemar test⁶) than RIPPER's default mode (ordered binarization). There were only two experiments (thyroid (repl.) and the test-set version of vowel), where R3 is worse than RIPPER, both differences being insignificant. On the 17 cross-validated problems, R3 reduces the average error to 75% of the error of RIPPER's error.⁷ The comparison to unordered RIPPER is similar (the significance levels for this case are not shown).

We can safely conclude that round robin binarization may result in significant improvements over ordered or unordered binarization without having a high risk of decreasing performance.