Next: Related work Up: Learning with Mixtures of Previous: Learning with Mixtures of

Introduction

Probabilistic inference has become a core technology in AI, largely due to developments in graph-theoretic methods for the representation and manipulation of complex probability distributions [Pearl 1988]. Whether in their guise as directed graphs (Bayesian networks) or as undirected graphs (Markov random fields), probabilistic graphical models have a number of virtues as representations of uncertainty and as inference engines. Graphical models allow a separation between qualitative, structural aspects of uncertain knowledge and the quantitative, parametric aspects of uncertainty--the former represented via patterns of edges in the graph and the latter represented as numerical values associated with subsets of nodes in the graph. This separation is often found to be natural by domain experts, taming some of the problems associated with structuring, interpreting, and troubleshooting the model. Even more importantly, the graph-theoretic framework has allowed for the development of general inference algorithms, which in many cases provide orders of magnitude speedups over brute-force methods [Cowell, Dawid, Lauritzen, Spiegelhalter 1999,Shafer, Shenoy 1990]. These virtues have not gone unnoticed by researchers interested in machine learning, and graphical models are being widely explored as the underlying architectures in systems for classification, prediction and density estimation [Bishop 1999,Friedman, Geiger, Goldszmidt 1997,Heckerman, Geiger, Chickering 1995,Hinton, Dayan, Frey, Neal 1995,Friedman, Getoor, Koller, Pfeffer 1996,Monti, Cooper 1998,Saul, Jordan 1999]. Indeed, it is possible to view a wide variety of classical machine learning architectures as instances of graphical models, and the graphical model framework provides a natural design procedure for exploring architectural variations on classical themes [Buntine 1996,Smyth, Heckerman, Jordan 1997]. As in many machine learning problems, the problem of learning a graphical model from data can be divided into the problem of parameter learning and the problem of structure learning. Much progress has been made on the former problem, much of it cast within the framework of the expectation-maximization (EM) algorithm [Lauritzen 1995]. The EM algorithm essentially runs a probabilistic inference algorithm as a subroutine to compute the ``expected sufficient statistics'' for the data, reducing the parameter learning problem to a decoupled set of local statistical estimation problems at each node of the graph. This link between probabilistic inference and parameter learning is an important one, allowing developments in efficient inference to have immediate impact on research on learning algorithms. The problem of learning the structure of a graph from data is significantly harder. In practice, most structure learning methods are heuristic methods that perform local search by starting with a given graph and improving it by adding or deleting one edge at a time [Heckerman, Geiger, Chickering 1995,Cooper, Herskovits 1992]. There is an important special case in which both parameter learning and structure learning are tractable, namely the case of graphical models in the form of a tree distribution. As shown by [Chow, Liu 1968], the tree distribution that maximizes the likelihood of a set of observations on

nodes--as well as the parameters of the tree--can be found in time quadratic in the number of variables in the domain. This algorithm is known as the Chow-Liu algorithm. Trees also have the virtue that probabilistic inference is guaranteed to be efficient, and indeed historically the earliest research in AI on efficient inference focused on trees [Pearl 1988]. Later research extended this early work by first considering general singly-connected graphs [Pearl 1988], and then considering graphs with arbitrary (acylic) patterns of connectivity [Cowell, Dawid, Lauritzen, Spiegelhalter 1999]. This line of research has provided one useful ``upgrade path'' from tree distributions to the complex Bayesian and Markov networks currently being studied. In this paper we consider an alternative upgrade path. Inspired by the success of mixture models in providing simple, effective generalizations of classical methods in many simpler density estimation settings [MacLachlan, Bashford 1988], we consider a generalization of tree distributions known as the mixtures-of-trees (MT) model. As suggested in Figure 1,

**Figure 1:** A mixture of trees over a domain consisting of random variables $V = \{a,b,c,d,e\}$ , where is a hidden *choice variable*. Conditional on the value of , the dependency structure is a tree. A detailed presentation of the mixture-of-trees model is provided in Section 3.
$\begin{figure} \centerline{\epsfig{file=figures/fig-mixtree.ps,height=2.5in}} \end{figure}$

the MT model involves the probabilistic mixture of a set of graphical components, each of which is a tree. In this paper we describe likelihood-based algorithms for learning the parameters and structure of such models. One can also consider probabilistic mixtures of more general graphical models; indeed, the general case is the Bayesian multinet introduced by Geiger+Heckerman:96. The Bayesian multinet is a mixture model in which each mixture component is an arbitrary graphical model. The advantage of Bayesian multinets over more traditional graphical models is the ability to represent context-specific independencies--situations in which subsets of variables exhibit certain conditional independencies for some, but not all, values of a conditioning variable. (Further work on context-specific independence has been presented by Boutilier:96). By making context-specific independencies explicit as multiple collections of edges, one can obtain (a) more parsimonious representations of joint probabilities and (b) more efficient inference algorithms. In the machine learning setting, however, the advantages of the general Bayesian multinet formalism are less apparent. Allowing each mixture component to be a general graphical model forces us to face the difficulties of learning general graphical structure. Moreover, greedy edge-addition and edge-deletion algorithms seem particularly ill-suited to the Bayesian multinet, given that it is the focus on collections of edges rather than single edges that underlies much of the intuitive appeal of this architecture. We view the mixture of trees as providing a reasonable compromise between the simplicity of tree distributions and the expressive power of the Bayesian multinet, while doing so within a restricted setting that leads to efficient machine learning algorithms. In particular, as we show in this paper, there is a simple generalization of the Chow-Liu algorithm that makes it possible to find (local) maxima of likelihoods (or penalized likelihoods) efficiently in general MT models. This algorithm is an iterative Expectation-Maximization (EM) algorithm, in which the inner loop (the M step) involves invoking the Chow-Liu algorithm to determine the structure and parameters of the individual mixture components. Thus, in a very concrete sense, this algorithm searches in the space of collections of edges. In summary, the MT model is a multiple network representation that shares many of the basic features of Bayesian and Markov network representations, but brings new features to the fore. We believe that these features expand the scope of graph-theoretic probabilistic representations in useful ways and may be particularly appropriate for machine learning problems.

Subsections

Related work

Next: Related work Up: Learning with Mixtures of Previous: Learning with Mixtures of

Journal of Machine Learning Research 2000-10-19