Next: Marginalization, inference and sampling Up: Learning with Mixtures of Previous: Related work

Tree distributions

In this section we introduce the tree model and the notation that will be used throughout the paper. Let

denote a set of

discrete random variables of interest. For each random variable $v\in V$ let $\Omega(v)$ represent its range, $x_v \in\Omega(v)$ a particular value, and

the (finite) cardinality of $\Omega(v)$ . For each subset

, let $\Omega(A) = \bigotimes_{v\in A} \Omega(v)$ and let

denote an assignment to the variables in

. To simplify notation

will be denoted by

and $\Omega(V)$ will be denoted simply by $\Omega$ . Sometimes we need to refer to the maximum of

over

; we denote this value by $r_{\it max}$ . We begin with undirected (Markov random field) representations of tree distributions. Identifying the vertex set of a graph with the set of random variables

, consider a graph

, where

is a set of undirected edges. We allow a tree to have multiple connected components (thus our ``trees'' are generally called forests). Given this definition, the number of edges $\vert E\vert$ and the number of connected components

are related as follows:

$\begin{displaymath} \vert E\vert+p \; =\; \vert V\vert, \end{displaymath}$

implying that adding an edge to a tree reduces the number of connected components by 1. Thus, a tree can have at most $\vert V\vert-1=n-1$ edges. In this latter case we refer to the tree as a spanning tree. We parameterize a tree in the following way. For $u, v \in V$ and $(u,v)\in E$ , let $T_{uv}$ denote a joint probability distribution on

and

. We require these distributions to be consistent with respect to marginalization, denoting by

the marginal of $T_{uv}(x_u,x_v)$ or $T_{vu}(x_v,x_u)$ with respect to

for any $v \neq u$ . We now assign a distribution

to the graph

as follows:

$\begin{displaymath} T(x) \;=\; \frac{\prod_{(u,v)\in E}T_{uv}(x_u,x_v)}{\prod_{v\in V}T_{v}(x_v)^{{\rm deg} v-1}}, \end{displaymath}$

(1)

where deg

is the degree of vertex

; i.e., the number of edges incident to $v\in V$ . It can be verified that

is in fact a probability distribution; moreover, the pairwise probabilities $T_{uv}$ are the marginals of

. A tree distribution

is defined to be any distribution that admits a factorization of the form (1). Tree distributions can also be represented using directed (Bayesian network) graphical models. Let

denote a directed tree (possibly a forest), where

is a set of directed edges and where each node

has (at most) one parent, denoted pa

. We parameterize this graph as follows:

$\begin{displaymath} T(x) \;=\;\prod_{v \in V} T_{v\vert{\rm pa(}v)}(x_v\vert x_{{\rm pa(}v)}) \end{displaymath}$

(2)

where $T_{v\vert{\rm pa(}v)}(x_v\vert x_{{\rm pa(}v)})$ is an arbitrary conditional distribution. It can be verified that

indeed defines a probability distribution; moreover, the marginal conditionals of

are given by the conditionals $T_{v\vert{\rm pa(}v)}$ . We shall call the representations (1) and (2) the undirected and directed tree representations of the distribution

respectively. We can readily convert between these representations; for example, to convert (1) to a directed representation we choose an arbitrary root in each connected component and direct each edge away from the root. For $(u,v)\in E$ with

closer to the root than

, let pa

. Now compute the conditional probabilities corresponding to each directed edge by recursively substituting ${T_{v{\rm pa}(v)}}/{T_{{\rm pa}(v)}}$ by $T_{v\vert{\rm pa}(v)}$ starting from the root. Figure 2 illustrates this process on a tree with 5 vertices.

**Figure 2:** A tree in its undirected (a) and directed (b) representations.
$\begin{figure} \begin{tabular}{cc} \epsfig{file=figures/fig-tree,width=2.5in}&... ...{2\vert 3}( x_{2}\vert x_{ 3 } ) $\\ (a) & (b) \end{tabular} \end{figure}$

The directed tree representation has the advantage of having independent parameters. The total number of free parameters in either representation is:

$\begin{displaymath} \sum_{(u,v) \in E}\!\!\!\! r_ur_v - \sum_{v \in V}({\rm ... ...m_{(u,v) \in E}\!\!\!(r_u-1)(r_v-1)+ \sum_{v \in V}r_v - n \end{displaymath}$

(3)

The right-hand side of (3) shows that each edge

increases the number of parameters by

. The set of conditional independencies associated with a tree distribution are readily characterized [Lauritzen, Dawid, Larsen, Leimer 1990]. In particular, two subsets $A,B \subset V$ are independent given $C\subset V$ if

intersects every path (ignoring the direction of edges in the directed case) between

and

for all $u\in A$ and $v\in B$ .

Subsections

Next: Marginalization, inference and sampling Up: Learning with Mixtures of Previous: Related work

Journal of Machine Learning Research 2000-10-19