Classification of Imbalanced Marketing
Data with Balanced Random Sets
Vladimir Nikulin and Geoffrey J. McLachlan ;
JMLR W & CP 7:89-100, 2009.
Abstract
With imbalanced data a classifier built using
all of the data has the tendency to ignore the minority class.
To overcome this problem, we propose to use an ensemble
classifier constructed on the basis of a large number of
relatively small and balanced subsets, where representatives
from both patterns are to be selected randomly. As an outcome,
the system produces the matrix of linear regression
coefficients whose rows represent the random sub- sets and the
columns represent the features. Based on this matrix, we make
an assessment of how stable the influence of a particular
feature is. It is proposed to keep in the model only features
with stable influence. The final model represents an average of
the base-learners, which is not necessarily a linear
regression. Proper data pre-processing is very important for
the effectiveness of the whole system, and it is proposed to
reduce the original data to the most simple binary sparse
format, which is particularly convenient for the construction
of decision trees. As a result, any particular feature will be
represented by several binary variables or bins, which are
absolutely equivalent in terms of data structure. This property
is very important and may be used for feature selection. The
proposed method exploits not only contributions of particular
variables to the base-learners, but also the diversity of such
contributions. Test results against KDD-2009 competition
datasets are presented.