Tree Induction vs. Logistic Regression: A Learning-Curve Analysis
Claudia Perlich, Foster Provost, Jeffrey S. Simonoff; 4(Jun):211-255, 2003.
Abstract
Tree induction and logistic regression are two standard,
off-the-shelf methods for building models for classification. We
present a large-scale experimental comparison of logistic regression
and tree induction, assessing classification accuracy and the quality
of rankings based on class-membership probabilities. We use a
learning-curve analysis to examine the relationship of these measures
to the size of the training set. The results of the study show
several things. (1) Contrary to some prior observations,
logistic regression does not generally outperform tree induction. (2)
More specifically, and not surprisingly, logistic regression is better
for smaller training sets and tree induction for larger data sets.
Importantly, this often holds for training sets drawn from the same
domain (that is, the learning curves cross), so conclusions about
induction-algorithm superiority on a given domain must be based on an
analysis of the learning curves. (3) Contrary to conventional wisdom,
tree induction is effective at producing probability-based rankings,
although apparently comparatively less so for a given training-set
size than at making classifications. Finally, (4) the domains on
which tree induction and logistic regression are ultimately preferable
can be characterized surprisingly well by a simple measure of
the separability of signal from noise.
[abs][pdf][ps.gz][ps]