Although real-world problems often have multiple classes, many learning algorithms are inherently binary, i.e., they are only able to discriminate between two classes. The reasons for this may be constraints imposed by the hypothesis language (e.g., linear discriminants or support vector machines), the learning architecture (e.g., neural networks with single output nodes), or the learning framework (e.g., many rule learning algorithms are tailored towards concept learning, i.e., the problem of learning a concept description from positive and negative examples). There are two principal approaches for applying such algorithms to multi-class problems: one approach is to generalize the algorithm--as has, e.g., been done for support vector machines (Weston and Watkins, 1999; Mayoraz and Alpaydin, 1999) or boosting (Freund and Schapire, 1997)--the other is to employ class binarization techniques (Section 2), which reduce the multi-class problem into a series of binary problems.
One of the most common class binarization approaches is to learn separate concept descriptions for each individual class, i.e., to form a series of problems, one for each class, where all examples of this class are regarded as positive examples, while all other examples are regarded as negative. Pairwise classification is an alternative class binarization technique, which has lately received some attention in the neural networks and support vector machines communities (Section 8). Its basic idea is to reduce a multi-class problem to multiple two-class problems by learning one classifier for each pair of classes, using only training examples for these two classes and ignoring all others (Section 3).
In this paper, we will, on the one hand, evaluate the technique--which we name round robin binarization--for inductive rule learning algorithms and show that it is significantly more accurate than the ordered and unordered (one-against-all) techniques that are currently used for rule learning algorithms like RIPPER (Section 4). On the other hand, we provide a detailed analysis of the complexity of the approach and show that the c(c-1)/2 learning problems of a single round robin can be learned more efficiently than the c learning problems of the conventional one-against-all technique (Section 5). This analysis is independent of the base learning algorithm, but we show that the advantage increases with more expensive learners. The theoretical results are confirmed by our experimental evaluation. In Section 6, we will investigate the properties of round-robin classification as a general ensemble technique with mixed results: while the technique is able to reduce C5.0's error on average, the gains do not approach those of boosting. On the other hand, a straightforward combination of round robin learning with bagging does lead to additional gains for RIPPER. Finally, we will discuss a few other relevant properties of round-robin classification, including its suitability for parallel implementations, possible remedies for its inefficiency at classification time, and its comprehensibility (Section 7).