The Coding Divergence for Measuring the Complexity of Separating Two Sets
Mahito Sugiyama (Kyoto University) and Akihiro Yamamoto (Kyoto University);
JMLR W&P 13:127-143, 2010.
Abstract
In this paper we integrate two essential processes, discretization of continuous
data and learning of a model that explains them, towards fully
computational machine learning from continuous data. Discretization
is fundamental for machine learning and data mining, since every continuous
datum; e.g., a real-valued datum obtained by observation in
the real world, must be discretized and converted from analog (continuous)
to digital (discrete) form to store in databases. However, most
machine learning methods do not pay attention to the situation; i.e.,
they use digital data in actual applications on a computer whereas
assume analog data (usually vectors of real numbers) theoretically.
To bridge the gap, we propose a novel measure of the difference between
two sets of data, called the coding divergence, and unify two
processes discretization and learning computationally. Discretization
of continuous data is realized by a topological mapping (in the sense
of mathematics) from the d-dimensional Euclidean space into the
Cantor space, and the simplest model is learned in the Cantor
space, which corresponds to the minimum open set separating the
given two sets of data. Furthermore, we construct a classifier using
the divergence, and experimentally demonstrate robust performance
of it. Our contribution is not only introducing a new measure from
the computational point of view, but also triggering more interaction
between experimental science and machine learning.