## Maximum Entropy Discrimination Markov Networks

** Jun Zhu, Eric P. Xing**; 10(88):2531−2569, 2009.

### Abstract

The standard maximum margin approach for structured prediction lacks
a straightforward probabilistic interpretation of the learning
scheme and the prediction rule. Therefore its unique advantages such
as dual sparseness and kernel tricks cannot be easily conjoined with
the merits of a probabilistic model such as Bayesian regularization,
model averaging, and ability to model hidden variables. In this
paper, we present a new general framework called *maximum
entropy discrimination Markov networks* (MaxEnDNet, or simply,
MEDN), which integrates these two approaches and combines and
extends their merits. Major innovations of this approach include: 1)
It extends the conventional max-entropy discrimination learning of
classification rules to a new *structural* max-entropy
discrimination paradigm of learning a distribution of Markov
networks. 2) It generalizes the extant Markov network
structured-prediction rule based on a point estimator of model
coefficients to an averaging model akin to a Bayesian predictor that
integrates over a learned posterior distribution of model
coefficients. 3) It admits flexible entropic regularization of the
model during learning. By plugging in different prior distributions
of the model coefficients, it subsumes the well-known maximum margin
Markov networks (M^{3}N) as a special case, and leads to a model
similar to an *L*_{1}-regularized M^{3}N that is simultaneously primal
and dual sparse, or other new types of Markov networks. 4) It
applies a modular learning algorithm that combines existing
variational inference techniques and convex-optimization based
M^{3}N solvers as subroutines.
Essentially, MEDN can be understood as a jointly maximum
likelihood and maximum margin estimate of Markov network. It
represents the first successful attempt to combine maximum entropy
learning (a dual form of maximum likelihood learning) with maximum
margin learning of Markov network for structured input/output
problems; and the basic principle can be generalized to learning
arbitrary graphical models, such as the generative Bayesian networks
or models with structured hidden variables. We discuss a number of
theoretical properties of this approach, and show that empirically it
outperforms a wide array of competing methods for structured
input/output learning on both synthetic and real OCR and web data
extraction data sets.

© JMLR 2009. (edit, beta) |