Next: Appendix A. Up: Learning with Mixtures of Previous: Experiments

Conclusions

We have presented the mixture of trees (MT), a probabilistic model in which joint probability distributions are represented as finite mixtures of tree distributions. Tree distributions have a number of virtues--representational, computational and statistical--but have limited expressive power. Bayesian and Markov networks achieve significantly greater expressive power while retaining many of the representational virtues of trees, but incur significantly higher costs on the computational and statistical fronts. The mixture approach provides an alternative upgrade path. While Bayesian and Markov networks have no distinguished relationships between edges, and statistical model selection procedures for these networks generally involve additions and deletions of single edges, the MT model groups overlapping sets of edges into mixture components, and edges are added and removed via a maximum likelihood algorithm that is constrained to fit tree models in each mixture component. We have also seen that it is straightforward to develop Bayesian methods that allow finer control over the choice of edges and smooth the numerical parameterization of each of the component models. [Chow, Liu 1968] presented the basic maximum likelihood algorithm for fitting tree distributions that provides the M step of our EM algorithm, and also showed how to use ensembles of trees to solve classification problems, where each tree models the class-conditional density of one of the classes. This approach was pursued by [Friedman, Geiger, Goldszmidt 1997,[Friedman, Goldszmidt, Lee 1998], who emphasized the connection with the naive Bayes model, and presented empirical results that demonstrated the performance gains that could be obtained by enhancing naive Bayes to allow connectivity between the attributes. Our work is a further contribution to this general line of research--we treat an ensemble of trees as a mixture distribution. The mixture approach provides additional flexibility in the classification domain, where the ``choice variable'' need not be the class label, and also allows the architecture to be applied to unsupervised learning problems. The algorithms for learning and inference that we have presented have relatively benign scaling: inference is linear in the dimensionality

and each step of the EM learning algorithm is quadratic in

. Such favorable time complexity is an important virtue of our tree-based approach. In particularly large problems, however, such as those that arise in information retrieval applications, quadratic complexity can become onerous. To allow the use of the MT model to such cases, we have developed the ACCL algorithm, where by exploiting data sparseness and paying attention to data structure issues we have significantly reduced run time: We presented examples in which the speed-up obtained from the ACCL algorithm was three orders of magnitude. Are there other classes of graphical models whose structure can be learned efficiently from data? Consider the class of Bayesian networks for which the topological ordering of the variables is fixed and the number of parents of each node is bounded by a fixed constant

. For this class the optimal model structure for a given target distribution can be found in ${\cal O}(n^{l+1})$ time by a greedy algorithm. These models share with trees the property of being matroids [West 1996]. The matroid is the unique algebraic structure for which the ``maximum weight'' problem, in particular the maximum weight spanning tree problem, is solved optimally by a greedy algorithm. Graphical models that are matroids have efficient structure learning algorithms; it is an interesting open problem to find additional examples of such models. We would like to acknowledge support for this project from the National Science Foundation (NSF grant IIS-9988642) and the Multidisciplinary Research Program of the Department of Defense (MURI N00014-00-1-0637).

Next: Appendix A. Up: Learning with Mixtures of Previous: Experiments

Journal of Machine Learning Research 2000-10-19