Next: The ACCL algorithm Up: Learning with Mixtures of Previous: The SPLICE dataset: Feature

The accelerated tree learning algorithm

We have argued that the mixture of trees approach has significant advantages over general Bayesian networks in terms of its algorithmic complexity. In particular, the M step of the EM algorithm for mixtures of trees is the CHOWLIU algorithm, which scales quadratically in the number of variables

and linearly in the size of the dataset

. Given that the E step is linear in

and

for mixtures of trees, we have a situation in which each pass of the EM algorithm is quadratic in

and linear in

. Although this time complexity recommends the MT approach for large-scale problems, the quadratic scaling in

becomes problematic for particularly large problems. In this section we propose a method for reducing the time complexity of the MT learning algorithm and demonstrate empirically the large performance gains that we are able to obtain with this method. As a concrete example of the kind of problems that we have in mind, consider the problem of clustering or classification of documents in information retrieval. Here the variables are words from a vocabulary, and the data points are documents. A document is represented as a binary vector with a component equal to 1 for each word

that is present in the document and equal to 0 for each word

that is not present. In a typical application the number of documents

is of the order of

, as is the vocabulary size

. Given such numbers, fitting a single tree to the data requires $n^2N\;\sim\;10^9-10^{12}$ counting operations. Note, however, that this domain is characterized by a certain sparseness: in particular, each document contains only a relatively small number of words and thus most of the components of its binary vector are 0. In this section, we show how to take advantage of data sparseness to accelerate the CHOWLIU algorithm. We show that in the sparse regime we can often rank order mutual information values without actually computing these values. We also show how to speed up the computation of the sufficient statistics by exploiting sparseness. Combining these two ideas yields an algorithm--the ACCL (accelerated Chow and Liu) algorithm--that provides significant performance gains in both running time and memory.

Subsections

Next: The ACCL algorithm Up: Learning with Mixtures of Previous: The SPLICE dataset: Feature

Journal of Machine Learning Research 2000-10-19