In this section, we will discuss a more complex class binarization procedure, the pairwise classifier. The basic idea is quite simple, namely to learn one classifier for each pair of classes. In analogy to round robin tournaments in sports and games, in which each participant is paired with each other participant, we call this procedure round robin binarization.1
12pt Round robin binarization is illustrated in
Figure 1. For the 6-class problem shown in
Figure 1(a), the round robin procedure learns 15
classifiers, one for each pair of classes.
Figure 1(c) shows the classifier <+,>. In
comparison, Figure 1(b) shows one of the
classifiers for the unordered class binarization, namely the one that
pairs class + against all other classes. It is obvious that in the
round robin case, the base classifier uses fewer examples and thus has
more freedom for fitting a decision boundary between the two classes.
In fact, in our example, all binary classification problems of the
round robin binarization could be solved with a simple linear
discriminant, while neither the multi-class problem nor its unordered
binarization have a linear solution. The phenomenon that pairwise
decision boundaries can be considerably simpler than those originating
from unordered binarization has also been observed in real-world
domains. For example, Knerr et al. (1992) observed
that the classes of a digit recognition task were pairwise linearly
separable, while the corresponding one-against-all task was not
amenable to single-layer networks. The results of
Hsu and Lin (2002), who obtained a larger advantage of
round robin binarization over unordered binarization for support
vector machines with a linear kernel than for support vector machines
with a non-linear kernel on several benchmark problems, could also be
explained by simpler decision boundaries in the round robin case.
A crucial point, of course, is how to decode the predictions of the
pairwise classifiers to a final prediction. We implemented a simple
voting technique: when classifying a new example, each of the learned
base classifiers determines to which of its two classes the example is
more likely to belong to. The winner is assigned a point, and in the
end, the algorithm will predict the class that has accumulated the
most points. We break ties by preferring the class that is more
frequent in the training set (or flipping a coin if the frequencies
are equal).
Note that some examples will be forced to be classified erroneously by
some of the binary base classifiers because each classifier must label
all examples as belonging to one of the two classes it was trained on.
Consider the classifier shown in Figure 1(c): it
will arbitrarily assign all examples of class o to either +
or (depending on which side of the decision boundary they are).
In principle, such ``unqualified'' votes may lead to an incorrect
final classification. However, the votes of the five classifiers that
contain examples of class o should be able to overrule the
votes of the other ten classifiers, which pick one of their two
constituent classes for each o example. If the class values
are independent, it is unlikely that all classifiers would unanimously
vote for a wrong class. However, the likelihood of such a situation
could increase if there is some similarity between the correct class
and some other class value (e.g., in problems with a hierarchical
class structure).
In any case, if the five o
classifiers unanimously vote for o, no other class can
accumulate five votes (because they lost their direct match against
o). Nevertheless, this simple voting procedure is certainly
suboptimal. We will discuss alternative decoding techniques
in Section 7.
In the above definition, we assume that the problem of discriminating class i from class j is identical to the problem of discriminating class j from class i. This is the case if the base learner is class-symmetric. Rule learning algorithms, however, need not be class-symmetric. Many of them choose one of the two classes as the default class, and learn only rules to cover the other class. In such a case, <i,j> and <j,i> may be two different classification problems, if j is used as the default class in the former, and i in the latter. A straightforward approach for addressing this problem is to play a so-called double round robin, in which separate classifiers are learned for both problems, <i,j> and <j,i>.2
24pt
In the following section, we will evaluate a double round robin as an alternative binarization strategy for the rule learner RIPPER.
name | train | test | sym | num | classes | def. error |
abalone | 3133 | 1044 | 1 | 7 | 29 | 83.5 |
car | 1728 | -- | 6 | 0 | 4 | 30.0 |
covertype | 15,120 | 565,892 | 44 | 10 | 7 | 51.1 |
glass | 214 | -- | 0 | 9 | 7 | 64.5 |
image | 2310 | -- | 0 | 19 | 7 | 85.7 |
letter | 16,000 | 4000 | 0 | 16 | 26 | 95.9 |
lr spectrometer | 531 | -- | 1 | 101 | 48 | 89.6 |
optical | 5620 | -- | 0 | 64 | 10 | 89.8 |
page-blocks | 5473 | -- | 0 | 10 | 5 | 10.2 |
sat | 4435 | 2000 | 0 | 36 | 6 | 76.2 |
shuttle | 43,500 | 14,500 | 0 | 9 | 7 | 21.4 |
solar flares (c) | 1389 | -- | 10 | 0 | 8 | 15.7 |
solar flares (m) | 1389 | -- | 10 | 0 | 6 | 4.9 |
soybean | 683 | -- | 35 | 0 | 19 | 94.1 |
thyroid (hyper) | 3772 | -- | 21 | 6 | 5 | 2.7 |
thyroid (hypo) | 3772 | -- | 21 | 6 | 5 | 7.7 |
thyroid (repl.) | 3772 | -- | 21 | 6 | 4 | 3.3 |
vehicle | 846 | -- | 0 | 18 | 4 | 74.2 |
vowel | 528 | 462 | 0 | 10 | 11 | 90.9 |
yeast | 1484 | -- | 0 | 8 | 10 | 68.8 |