Sufficient Dimensionality Reduction
Amir Globerson, Naftali Tishby;
3(Mar):1307-1331, 2003.
Abstract
Dimensionality reduction of empirical co-occurrence data
is a fundamental problem in unsupervised learning. It is also a well
studied problem in statistics known as the analysis of
cross-classified data.
One principled approach to this problem is to represent the data in
low dimension with minimal loss of (mutual) information
contained in the original data. In this paper we introduce an
information theoretic nonlinear method for finding such a most informative
dimension reduction.
In contrast with previously introduced clustering based approaches,
here we extract
continuous feature functions directly from the
co-occurrence matrix. In a sense, we automatically extract functions
of the variables that serve as approximate sufficient statistics for
a sample of one variable about the other one.
Our method is different from dimensionality reduction methods which are based
on a specific, sometimes arbitrary, metric or embedding.
Another interpretation of our method is as
generalized - multi-dimensional - non-linear regression, where rather than
fitting one regression function through two dimensional data, we
extract
d-regression functions whose expectation values capture the
information among the variables. It thus presents a new learning
paradigm that unifies aspects from both supervised and unsupervised
learning. The resulting dimension reduction can be described by two
conjugate d-dimensional differential manifolds that are coupled through
Maximum Entropy
I-projections. The Riemannian metrics of these manifolds
are determined by the observed expectation values of our extracted features.
Following this geometric interpretation we present an iterative
information projection algorithm for finding such features
and prove its convergence. Our algorithm is similar to the method of
"association analysis" in statistics, though the feature extraction
context as well as the information theoretic and geometric
interpretation are new.
The algorithm is illustrated by
various synthetic co-occurrence data. It is then demonstrated for
text categorization and information retrieval and proves effective
in selecting a small set of features, often improving
performance over the original feature set.
[abs]
[pdf]
[ps.gz]
[ps]