The Learning-Curve Sampling Method Applied to Model-Based Clustering
Christopher Meek, Bo Thiesson, David Heckerman;
2(Feb):397-418, 2002.
Abstract
We examine the learning-curve sampling method, an approach for
applying machine-learning algorithms to large data sets. The
approach is based on the observation that the computational cost
of learning a model increases as a function of the sample size of
the training data, whereas the accuracy of a model has
diminishing improvements as a function of sample size. Thus, the
learning-curve sampling method monitors the increasing costs and
performance as larger and larger amounts of data are used for
training, and terminates learning when future costs outweigh
future benefits. In this paper, we formalize the learning-curve
sampling method and its associated cost-benefit tradeoff in terms
of decision theory. In addition, we describe the application of
the learning-curve sampling method to the task of model-based
clustering via the expectation-maximization (EM) algorithm. In
experiments on three real data sets, we show that the
learning-curve sampling method produces models that are nearly as
accurate as those trained on complete data sets, but with
dramatically reduced learning times. Finally, we describe an
extension of the basic learning-curve approach for model-based
clustering that results in an additional speedup. This extension is
based on the observation that the shape of the learning curve for
a given model and data set is roughly independent of the number
of EM iterations used during training. Thus, we run EM for only a
few iterations to decide how many cases to use for training, and
then run EM to full convergence once the number of cases is
selected.
[abs]
[pdf]
[ps.gz]
[ps]