Active Batch Learning with Stochastic Query-by-Forest (SQBF)
A. Borisov, E. Tuv & G.
Runger; JMLR W&CP 16:59–69, 2011.
Abstract
In a conventional machine learning approach, one uses labeled data to train the
model. However, often we have a data set with few labeled instances, and a large number of
unlabeled ones. This is called a semi-supervised learning problem. It is well known that often
unlabeled data could be used to improve a model. In real world scenarios, labeled data can
usually be obtained dynamically. However, obtaining new labels in most cases requires human
effort and/or is costly. An active learning (AL) paradigm tries to direct the queries in such way
that a good model can be trained with a relatively small number of queries. In this work we focus
on so-called pool-based active learning, i.e., learning when there is a fixed large pool of
unlabeled data, and we can query the label for any instance from this pool at some cost.
Existing methods are often based on strong assumptions for the joint input/output
distribution (i.e., a mixture of Gaussians, linearly separable input space, etc.), or use a
distance-based approach (such as Euclidean or Mahalanobis distances). That makes such
methods very susceptible to noise in input space, and they often work poorly in high
dimensions. Also, such methods assume numeric inputs only. In addition, for most
methods relying on distance computations and/or linear models, computational complexity
scales at least quadratically with respect to the number of unlabeled samples, rendering
them useless on large datasets. In real world applications data is often large, noisy,
contains irrelevant inputs, missing values, and mixed variable types. Often queries
should be arranged in groups or batches (this is called batch AL). In batch querying one
should consider both the ’usefulness’ of individual queries within a batch, and the
batch diversity. Batch AL, although being very practical by nature, is rarely addressed
by existing AL approaches. Here we propose a new non-parametric approach to the
AL problem called Stochastic Query by Forest (SQRF), that effectively addresses the
challenges described above. Our algorithm is based on a QBC algorithm applied to
an RF ensemble, and our main contribution is the batch diversification strategy. We
describe two different strategies for batch selection, the first of which achieved the
highest average score on the AISTATS 2010 active learning challenge and ranked top on
one of the challenge datasets. Our work focuses on binary classification problems, but
our method can be directly applied to regression or multi-class problems with minor
modifications.
Page last modified on Wed Mar 30 11:09:21 2011.