Mining Recurring Concept Drifts with Limited Labeled Streaming Data
Peipei Li, (Hefei University of Technology), Xindong Wu (University
of Vermont), and Xuegang Hu (Hefei University of Technology);
JMLR W&P 13:241-252, 2010.
Abstract
Tracking recurring concept drifts is a significant issue for machine
learning and data mining that frequently appears in real world stream
classification problems. It is a challenge for many streaming classification
algorithms to learn recurring concepts in a data stream envi-
ronment with unlabeled data, and this challenge has received little
attention from the research community. Motivated by this challenge,
this paper focuses on the problem of recurring contexts in streaming
environments with limited labeled data. We propose a Semisupervised
classification algorithm for data streams with REcurring
concept Drifts and Limited LAbeled data, called REDLLA, in which,
a decision tree is adopted as the classification model. When growing a
tree, a clustering algorithm based on k-Means is installed to produce
concept clusters and unlabeled data are labeled at leaves. In view of
deviations between history and new concept clusters, potential concept
drifts are distinguished and recurring concepts are maintained.
Extensive studies on both synthetic and real-world data confirm the
advantages of our REDLLA algorithm over two state-of-the-art online
classification algorithms of CVFDT and CDRDT and several known
online semi-supervised algorithms, even in the case with more than
90% unlabeled data.