http://www.jmlr.org
JMLRJournal of Machine Learning Research
Structural Learning of Chain Graphs via Decomposition
http://jmlr.org/papers/v9/ma08a.html
2008
Chain graphs present a broad class of graphical models for description
of conditional independence structures, including both Markov networks
and Bayesian networks as special cases. In this paper, we propose a
computationally feasible method for the structural learning of chain
graphs based on the idea of decomposing the learning problem into a
set of smaller scale problems on its decomposed subgraphs. The
decomposition requires conditional independencies but does not require
the separators to be complete subgraphs. Algorithms for both skeleton
recovery and complex arrow orientation are presented. Simulations
under a variety of settings demonstrate the competitive performance of
our method, especially when the underlying graph is sparse.
Magic Moments for Structured Output Prediction
http://jmlr.org/papers/v9/ricci08a.html
2008
Most approaches to structured output prediction rely on a hypothesis space
of <i>prediction functions</i> that compute their output by maximizing
a linear <i>scoring function</i>.
In this paper we present two novel learning algorithms for this
hypothesis class, and a statistical analysis of their performance.
The methods rely on efficiently computing the first two moments
of the scoring function over the output space, and using them to
create convex objective functions for training.
We report extensive experimental results for sequence alignment,
named entity recognition, and RNA secondary structure prediction.
Robust Submodular Observation Selection
http://jmlr.org/papers/v9/krause08b.html
2008
In many applications, one has to actively select among a set of
expensive observations before making an informed decision. For
example, in environmental monitoring, we want to select locations to
measure in order to most effectively predict spatial phenomena.
Often, we want to select observations which are robust against a
number of possible objective functions. Examples include minimizing
the maximum posterior variance in Gaussian Process regression,
robust experimental design, and sensor placement for outbreak
detection. In this paper, we present the <i>Submodular
Saturation</i> algorithm, a simple and efficient algorithm with strong
theoretical approximation guarantees for cases where the possible
objective functions exhibit <i>submodularity</i>, an intuitive
diminishing returns property. Moreover, we prove that better
approximation algorithms do not exist unless NP-complete
problems admit efficient algorithms. We show how our algorithm can
be extended to handle complex cost functions (incorporating non-unit
observation cost or communication and path costs). We also show how
the algorithm can be used to near-optimally trade off expected-case
(e.g., the Mean Square Prediction Error in Gaussian Process
regression) and worst-case (e.g., maximum predictive variance)
performance. We show that many important machine learning problems
fit our robust submodular observation selection formalism, and
provide extensive empirical evaluation on several real-world
problems. For Gaussian Process regression, our algorithm compares
favorably with state-of-the-art heuristics described in the
geostatistics literature, while being simpler, faster and providing
theoretical guarantees. For robust experimental design, our
algorithm performs favorably compared to SDP-based algorithms.
Automatic PCA Dimension Selection for High Dimensional Data and Small Sample Sizes
http://jmlr.org/papers/v9/hoyle08a.html
2008
Bayesian inference from high-dimensional data involves the
integration over a large number of model parameters. Accurate
evaluation of such high-dimensional integrals raises a unique set
of issues. These issues are illustrated using the exemplar of
model selection for principal component analysis (PCA). A Bayesian
model selection criterion, based on a Laplace approximation to the
model evidence for determining the number of signal principal
components present in a data set, has previously been show to
perform well on various test data sets. Using simulated data we
show that for <i>d</i>-dimensional data and small sample sizes, <i>N</i>,
the accuracy of this model selection method is strongly affected
by increasing values of <i>d</i>. By taking proper account of the
contribution to the evidence from the large number of
model parameters we show that model selection accuracy is
substantially improved. The accuracy of the improved model evidence is studied
in the asymptotic limit <i>d</i> → ∞ at fixed ratio
α = <i>N</i>/<i>d</i>, with α < 1. In this limit, model selection
based upon the improved model evidence agrees with a frequentist
hypothesis testing approach.
Learning Bounded Treewidth Bayesian Networks
http://jmlr.org/papers/v9/elidan08a.html
2008
With the increased availability of data for complex domains, it is
desirable to learn Bayesian network structures that are sufficiently
expressive for generalization while at the same time allow for
tractable inference. While the method of thin junction trees can, in
principle, be used for this purpose, its fully greedy nature makes it
prone to overfitting, particularly when data is scarce. In this work
we present a novel method for learning Bayesian networks of bounded
treewidth that employs global structure modifications and that is
polynomial both in the size of the graph and the treewidth bound. At
the heart of our method is a dynamic triangulation that we update in a
way that facilitates the addition of chain structures that increase
the bound on the model's treewidth by at most one. We demonstrate the
effectiveness of our "treewidth-friendly" method on several
real-life data sets and show that it is superior to the greedy
approach as soon as the bound on the treewidth is nontrivial.
Importantly, we also show that by making use of global operators, we
are able to achieve better generalization even when learning Bayesian
networks of unbounded treewidth.
JNCC2: The Java Implementation Of Naive Credal Classifier 2
http://jmlr.org/papers/v9/corani08b.html
2008
JNCC2 implements the <i>naive credal classifier 2</i> (NCC2). This is
an extension of naive Bayes to imprecise probabilities that aims at
delivering robust classifications also when dealing with small or
incomplete data sets. Robustness is achieved by delivering set-valued
classifications (that is, returning multiple classes) on the
instances for which (i) the learning set is not informative enough to
smooth the effect of choice of the prior density or (ii) the
uncertainty arising from missing data prevents the reliable
indication of a single class. JNCC2 is released under the GNU GPL
license.
An Extension on ``Statistical Comparisons of Classifiers over Multiple Data Sets'' for all Pairwise Comparisons
http://jmlr.org/papers/v9/garcia08a.html
2008
In a recently published paper in JMLR, Demšar (2006) recommends
a set of non-parametric statistical tests and procedures which can
be safely used for comparing the performance of classifiers over
multiple data sets. After studying the paper, we realize that the
paper correctly introduces the basic procedures and some of the
most advanced ones when comparing a control method. However, it
does not deal with some advanced topics in depth. Regarding these
topics, we focus on more powerful proposals of statistical
procedures for comparing <i>n</i> × <i>n</i> classifiers. Moreover, we
illustrate an easy way of obtaining adjusted and comparable
<i>p</i>-values in multiple comparison procedures.
Multi-Agent Reinforcement Learning in Common Interest and Fixed Sum Stochastic Games: An Experimental Study
http://jmlr.org/papers/v9/bab08a.html
2008
Multi Agent Reinforcement Learning (MARL) has received continually
growing attention in the past decade. Many algorithms that vary in
their approaches to the different subtasks of MARL have been
developed. However, the theoretical convergence results for these
algorithms do not give a clue as to their practical performance nor
supply insights to the dynamics of the learning process itself. This
work is a comprehensive empirical study conducted on <i>MGS</i>, a
simulation system developed for this purpose. It surveys the
important algorithms in the field, demonstrates the strengths and
weaknesses of the different approaches to MARL through application
of FriendQ, OAL, WoLF, FoeQ, Rmax, and other algorithms to a variety
of fully cooperative and fully competitive domains in self and
heterogeneous play, and supplies an informal analysis of the
resulting learning processes. The results can aid in the design of
new learning algorithms, in matching existing algorithms to specific
tasks, and may guide further research and formal analysis of the
learning processes.
Model Selection for Regression with Continuous Kernel Functions Using the Modulus of Continuity
http://jmlr.org/papers/v9/koo08b.html
2008
This paper presents a new method of model selection for regression
problems using the modulus of continuity. For this purpose, we
suggest the prediction risk bounds of regression models using the
modulus of continuity which can be interpreted as the complexity of
functions. We also present the model selection criterion referred to
as the modulus of continuity information criterion (MCIC) which is
derived from the suggested prediction risk bounds. The suggested
MCIC provides a risk estimate using the modulus of continuity for a
trained regression model (or an estimation function) while other
model selection criteria such as the AIC and BIC use structural
information such as the number of training parameters. As a result,
the suggested MCIC is able to discriminate the performances of
trained regression models, even with the same structure of training
models. To show the effectiveness of the proposed method, the
simulation for function approximation using the multilayer
perceptrons (MLPs) was conducted. Through the simulation for
function approximation, it was demonstrated that the suggested MCIC
provides a good selection tool for nonlinear regression models, even
with the limited size of data.
Visualizing Data using t-SNE
http://jmlr.org/papers/v9/vandermaaten08a.html
2008
We present a new technique called "t-SNE" that visualizes
high-dimensional data by giving each datapoint a location in a two or
three-dimensional map. The technique is a variation of Stochastic
Neighbor Embedding (Hinton and Roweis, 2002) that is much easier to optimize,
and produces significantly better visualizations by reducing the
tendency to crowd points together in the center of the map. t-SNE is
better than existing techniques at creating a single map that reveals
structure at many different scales. This is particularly important for
high-dimensional data that lie on several different, but related,
low-dimensional manifolds, such as images ofobjects from multiple
classes seen from multiple viewpoints. For visualizing the structure
of very large data sets, we show how t-SNE can use random walks on
neighborhood graphs to allow the implicit structure of all of the data
to influence the way in which a subset of the data is displayed. We
illustrate the performance of t-SNE on a wide variety of data sets and
compare it with many other non-parametric visualization techniques,
including Sammon mapping, Isomap, and Locally Linear Embedding. The
visualizations produced by t-SNE are significantly better than those
produced by the other techniques on almost all of the data sets.
Stationary Features and Cat Detection
http://jmlr.org/papers/v9/fleuret08a.html
2008
<p>
Most discriminative techniques for detecting instances from object
categories in still images consist of looping over a partition of a
pose space with dedicated binary classifiers. The efficiency of this
strategy for a complex pose, that is, for fine-grained descriptions, can
be assessed by measuring the effect of sample size and pose resolution
on accuracy and computation. Two conclusions emerge: (1) fragmenting
the training data, which is inevitable in dealing with high in-class
variation, severely reduces accuracy; (2) the computational cost at
high resolution is prohibitive due to visiting a massive pose
partition.
</p>
<p>
To overcome data-fragmentation we propose a novel framework centered
on pose-indexed features which assign a response to a pair consisting
of an image and a pose, and are designed to be stationary: the
probability distribution of the response is always the same if an
object is actually present. Such features allow for efficient,
one-shot learning of pose-specific classifiers. To avoid expensive
scene processing, we arrange these classifiers in a hierarchy based on
nested partitions of the pose; as in previous work on coarse-to-fine
search, this allows for efficient processing.
</p>
<p>
The hierarchy is then "folded" for training: all the classifiers at
each level are derived from one base predictor learned from all the
data. The hierarchy is "unfolded" for testing: parsing a scene amounts
to examining increasingly finer object descriptions only when there is
sufficient evidence for coarser ones. In this way, the detection
results are equivalent to an exhaustive search at high resolution. We
illustrate these ideas by detecting and localizing cats in highly
cluttered greyscale scenes.
</p>
Active Learning of Causal Networks with Intervention Experiments and Optimal Designs
http://jmlr.org/papers/v9/he08a.html
2008
The causal discovery from data is important for various scientific
investigations. Because we cannot distinguish the different directed
acyclic graphs (DAGs) in a Markov equivalence class learned from
observational data, we have to collect further information on causal
structures from experiments with external interventions. In this
paper, we propose an active learning approach for discovering causal
structures in which we first find a Markov equivalence class from
observational data, and then we orient undirected edges in every
chain component via intervention experiments separately. In the
experiments, some variables are manipulated through external
interventions. We discuss two kinds of intervention experiments,
randomized experiment and quasi-experiment. Furthermore, we give two
optimal designs of experiments, a batch-intervention design and a
sequential-intervention design, to minimize the number of
manipulated variables and the set of candidate structures based on
the minimax and the maximum entropy criteria. We show theoretically
that structural learning can be done locally in subgraphs of chain
components without need of checking illegal v-structures and cycles
in the whole network and that a Markov equivalence subclass obtained
after each intervention can still be depicted as a chain graph.
SimpleMKL
http://jmlr.org/papers/v9/rakotomamonjy08a.html
2008
Multiple kernel learning (MKL) aims at simultaneously learning a
kernel and the associated predictor in supervised learning
settings. For the support vector machine, an efficient and general
multiple kernel learning algorithm, based on semi-infinite linear
programming, has been recently proposed. This approach has opened new
perspectives since it makes MKL tractable for large-scale problems, by
iteratively using existing support vector machine code. However, it
turns out that this iterative algorithm needs numerous iterations for
converging towards a reasonable solution. In this paper, we address
the MKL problem through a weighted 2-norm regularization formulation
with an additional constraint on the weights that encourages sparse
kernel combinations. Apart from learning the combination, we solve a
standard SVM optimization problem, where the kernel is defined as a
linear combination of multiple kernels. We propose an algorithm,
named SimpleMKL, for solving this MKL problem and provide a new
insight on MKL algorithms based on mixed-norm regularization by
showing that the two approaches are equivalent. We show how SimpleMKL
can be applied beyond binary classification, for problems like
regression, clustering (one-class classification) or multiclass
classification. Experimental results show that the proposed algorithm
converges rapidly and that its efficiency compares favorably to other
MKL algorithms. Finally, we illustrate the usefulness of MKL for some
regressors based on wavelet kernels and on some model selection
problems related to multiclass classification problems.
On the Equivalence of Linear Dimensionality-Reducing Transformations
http://jmlr.org/papers/v9/loog08a.html
2008
In this JMLR volume, Ye (2008) demonstrates the essential
equivalence of two sets of solutions to a generalized Fisher criterion
used for linear dimensionality reduction (see Ye, 2005; Loog, 2007).
Here, I point out the basic flaw in this new contribution.
Minimal Nonlinear Distortion Principle for Nonlinear Independent Component Analysis
http://jmlr.org/papers/v9/zhang08b.html
2008
It is well known that solutions to the nonlinear independent
component analysis (ICA) problem are highly non-unique. In this
paper we propose the "minimal nonlinear distortion" (MND)
principle for tackling the ill-posedness of nonlinear ICA
problems. MND prefers the nonlinear ICA solution with the
estimated mixing procedure as close as possible to linear,
among all possible solutions.
It also helps to avoid local optima in the
solutions. To achieve MND, we exploit a regularization term to
minimize the mean square error between the nonlinear mixing
mapping and the best-fitting linear one. The effect of MND on
the inherent trivial and non-trivial indeterminacies in
nonlinear ICA solutions is investigated. Moreover, we show that
local MND is closely related to the smoothness regularizer
penalizing large curvature, which provides another useful
regularization
condition for nonlinear ICA.
Experiments on synthetic data show the usefulness of the MND
principle for separating various nonlinear mixtures.
Finally, as an application, we use nonlinear ICA with MND to
separate daily returns of a set of stocks in Hong Kong, and the
linear causal relations among them are successfully discovered.
The resulting causal relations give some interesting insights
into the stock market. Such a result can not be achieved by
linear ICA. Simulation studies also verify that when doing
causality discovery, sometimes one should not ignore the
nonlinear distortion in the data generation procedure, even if
it is weak.
On the Size and Recovery of Submatrices of Ones in a Random Binary Matrix
http://jmlr.org/papers/v9/sun08a.html
2008
Binary matrices, and their associated submatrices of 1s, play a
central role in the study of random bipartite graphs and in core data
mining problems such as frequent itemset mining (FIM). Motivated by
these connections, this paper addresses several statistical questions
regarding submatrices of 1s in a random binary matrix with independent
Bernoulli entries. We establish a three-point concentration result,
and a related probability bound, for the size of the largest square
submatrix of 1s in a square Bernoulli matrix, and extend these results
to non-square matrices and submatrices with fixed aspect ratios. We
then consider the noise sensitivity of frequent itemset mining under a
simple binary additive noise model, and show that, even at small noise
levels, large blocks of 1s leave behind fragments of only logarithmic
size. As a result, standard FIM algorithms, which search only for
submatrices of 1s, cannot directly recover such blocks when noise is
present. On the positive side, we show that an error-tolerant
frequent itemset criterion can recover a submatrix of 1s against a
background of 0s plus noise, even when the size of the submatrix of 1s
is very small.
Non-Parametric Modeling of Partially Ranked Data
http://jmlr.org/papers/v9/lebanon08a.html
2008
Statistical models on full and partial rankings of <i>n</i> items are
often of limited practical use for large <i>n</i> due to computational
consideration. We explore the use of non-parametric models for
partially ranked data and derive computationally efficient
procedures for their use for large <i>n</i>. The derivations are
largely possible through combinatorial and algebraic manipulations
based on the lattice of partial rankings. A bias-variance analysis
and an experimental study demonstrate the applicability of the
proposed method.
Model Selection in Kernel Based Regression using the Influence Function
http://jmlr.org/papers/v9/debruyne08a.html
2008
Recent results about the robustness of kernel methods involve the
analysis of influence functions. By definition the influence function
is closely related to leave-one-out criteria. In statistical
learning, the latter is often used to assess the generalization of a
method. In statistics, the influence function is used in a similar way
to analyze the statistical efficiency of a method. Links between both
worlds are explored. The influence function is related to the first
term of a Taylor expansion. Higher order influence functions are
calculated. A recursive relation between these terms is found
characterizing the full Taylor expansion. It is shown how to evaluate
influence functions at a specific sample distribution to obtain an
approximation of the leave-one-out error. A specific implementation is
proposed using a <i>L</i><sub>1</sub> loss in the selection of the hyperparameters
and a Huber loss in the estimation procedure. The parameter in the
Huber loss controlling the degree of robustness is optimized as
well. The resulting procedure gives good results, even when outliers
are present in the data.
Learning to Select Features using their Properties
http://jmlr.org/papers/v9/krupka08b.html
2008
Feature selection is the task of choosing a small subset of features
that is sufficient to predict the target labels well. Here, instead
of trying to directly determine which features are better, we attempt
to learn the properties of good features. For this purpose we assume
that each feature is represented by a set of properties, referred
to as <i>meta-features</i>. This approach enables prediction of the
quality of features without measuring their value on the training
instances. We use this ability to devise new selection algorithms
that can efficiently search for new good features in the presence
of a huge number of features, and to dramatically reduce the number
of feature measurements needed. We demonstrate our algorithms on a
handwritten digit recognition problem and a visual object category
recognition problem. In addition, we show how this novel viewpoint
enables derivation of better generalization bounds for the joint learning
problem of selection and classification, and how it contributes to
a better understanding of the problem. Specifically, in the context
of object recognition, previous works showed that it is possible to
find one set of features which fits most object categories (aka a
<i>universal dictionary</i>). Here we use our framework to analyze
one such universal dictionary and find that the quality of features
in this dictionary can be predicted accurately by its meta-features.
Probabilistic Characterization of Random Decision Trees
http://jmlr.org/papers/v9/dhurandhar08a.html
2008
In this paper we use the methodology introduced by Dhurandhar and Dobra (2009) for
analyzing the error of classifiers and the model selection measures,
to analyze decision tree algorithms. The methodology consists of
obtaining parametric expressions for the moments of the generalization
error (GE) for the classification model of interest, followed by
plotting these expressions for interpretability. The major challenge
in applying the methodology to decision trees, the main theme of this
work, is customizing the generic expressions for the moments of GE to
this particular classification algorithm. The specific contributions
we make in this paper are: (a) we primarily characterize a subclass of
decision trees namely, Random decision trees, (b) we discuss how the
analysis extends to other decision tree algorithms and (c) in order to
extend the analysis to certain model selection measures, we generalize
the relationships between the moments of GE and moments of the model
selection measures given in (Dhurandhar and Dobra, 2009) to randomized classification
algorithms. An empirical comparison of the proposed method with Monte
Carlo and distribution free bounds obtained using Breiman's formula,
depicts the advantages of the method in terms of running time and
accuracy. It thus showcases the use of the deployed methodology as an
exploratory tool to study learning algorithms.
Randomized Online PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension
http://jmlr.org/papers/v9/warmuth08a.html
2008
<p>
We design an online algorithm for Principal Component
Analysis. In each trial
the current instance is centered and projected into
a probabilistically chosen low dimensional subspace.
The regret of our online algorithm,
that is, the total expected quadratic compression loss of the
online algorithm minus the total quadratic compression loss
of the batch algorithm, is bounded
by a term whose dependence on the dimension of the
instances is only logarithmic.
</p>
<p>
We first develop our methodology in the expert setting of online
learning by giving an algorithm for learning as well
as the best subset of experts of a certain size.
This algorithm is then lifted to the matrix setting where
the subsets of experts correspond to subspaces.
The algorithm represents the uncertainty over the best subspace
as a density matrix whose eigenvalues are bounded.
The running time is <i>O</i>(<i>n</i><sup>2</sup>) per trial, where
<i>n</i> is the dimension of the instances.
</p>
Finding Optimal Bayesian Network Given a Super-Structure
http://jmlr.org/papers/v9/perrier08a.html
2008
Classical approaches used to learn Bayesian network structure from
data have disadvantages in terms of complexity and lower accuracy of
their results. However, a recent empirical study has shown that a
hybrid algorithm improves sensitively accuracy and speed: it learns a
skeleton with an independency test (IT) approach and constrains on the
directed acyclic graphs (DAG) considered during the search-and-score
phase. Subsequently, we theorize the structural constraint by
introducing the concept of super-structure <i>S</i>, which is an
undirected graph that restricts the search to networks whose skeleton
is a subgraph of <i>S</i>. We develop a super-structure constrained
optimal search (COS): its time complexity is upper bounded by
<i>O</i>(γ<i><sub>m</sub><sup>n</sup></i>), where
γ<sub><i>m</i></sub><2 depends on the maximal degree <i>m</i> of
<i>S</i>. Empirically, complexity depends on the average degree
<i>m-tilde</i> and sparse structures allow larger graphs to be
calculated. Our algorithm is faster than an optimal search by several
orders and even finds more accurate results when given a sound
super-structure. Practically, <i>S</i> can be approximated by IT
approaches; significance level of the tests controls its sparseness,
enabling to control the trade-off between speed and accuracy. For
incomplete super-structures, a greedily post-processed version (COS+)
still enables to significantly outperform other heuristic searches.
Forecasting Web Page Views: Methods and Observations
http://jmlr.org/papers/v9/li08a.html
2008
Web sites must forecast Web page views in order to plan computer
resource allocation and estimate upcoming revenue and advertising
growth. In this paper, we focus on extracting trends and seasonal
patterns from page view series, two dominant factors in the variation
of such series. We investigate the Holt-Winters procedure and a state
space model for making relatively short-term prediction. It is found
that Web page views exhibit strong impulsive changes occasionally.
The impulses cause large prediction errors long after their
occurrences. A method is developed to identify impulses and to
alleviate their damage on prediction. We also develop a long-range
trend and season extraction method, namely the <i>Elastic Smooth
Season Fitting (ESSF)</i> algorithm, to compute scalable and smooth
yearly seasons. ESSF derives the yearly season by minimizing the
residual sum of squares under smoothness regularization, a quadratic
optimization problem. It is shown that for long-term prediction, ESSF
improves accuracy significantly over other methods that ignore the
yearly seasonality.
Ranking Individuals by Group Comparisons
http://jmlr.org/papers/v9/huang08a.html
2008
This paper proposes new approaches to rank individuals from their
group comparison results. Many real-world problems are of this
type. For example, ranking players from team comparisons is important
in some sports. In machine learning, a closely related application is
classification using coding matrices. Group comparison results are
usually in two types: binary indicator outcomes (wins/losses) or
measured outcomes (scores). For each type of results, we propose new
models for estimating individuals' abilities, and hence a ranking of
individuals. The estimation is carried out by solving convex
minimization problems, for which we develop easy and efficient
solution procedures. Experiments on real bridge records and
multi-class classification demonstrate the viability of the proposed
models.
A Moment Bound for Multi-hinge Classifiers
http://jmlr.org/papers/v9/tarigan08a.html
2008
The success of support vector machines in binary classification relies on
the fact that hinge loss employed in the risk minimization targets the
Bayes rule. Recent research explores some extensions of this large margin
based method to the multicategory case. We show a moment bound for
the so-called multi-hinge loss minimizers based on
two kinds of complexity constraints: entropy with bracketing and empirical
entropy. Obtaining such a result based on the latter is harder than finding
one based on the former. We obtain fast rates of convergence that adapt to the
unknown margin.
HPB: A Model for Handling BN Nodes with High Cardinality Parents
http://jmlr.org/papers/v9/jambeiro08a.html
2008
We replaced the conditional probability tables of Bayesian network
nodes whose parents have high cardinality with a multilevel empirical
hierarchical Bayesian model called hierarchical pattern Bayes (HPB).
The resulting Bayesian networks achieved significant performance
improvements over Bayesian networks with the same structure and
traditional conditional probability tables, over Bayesian networks
with simpler structures like naïve Bayes and tree augmented naïve
Bayes, over Bayesian networks where traditional conditional
probability tables were substituted by noisy-OR gates, default tables,
decision trees and decision graphs and over Bayesian networks
constructed after a cardinality reduction preprocessing phase using
the agglomerative information bottleneck method. Our main tests took
place in important fraud detection domains, which are characterized by
the presence of high cardinality attributes and by the existence of
relevant interactions among them. Other tests, over UCI data sets,
show that HPB may have a quite wide applicability.
Gradient Tree Boosting for Training Conditional Random Fields
http://jmlr.org/papers/v9/dietterich08a.html
2008
Conditional random fields (CRFs) provide a flexible and powerful model
for sequence labeling problems. However, existing learning algorithms
are slow, particularly in problems with large numbers of potential
input features and feature combinations. This paper describes a new
algorithm for training CRFs via gradient tree boosting. In tree
boosting, the CRF potential functions are represented as weighted sums
of regression trees, which provide compact representations of feature
interactions. So the algorithm does not explicitly consider the
potentially large parameter space. As a result, gradient tree
boosting scales linearly in the order of the Markov model and in the
order of the feature interactions, rather than exponentially as in
previous algorithms based on iterative scaling and gradient descent.
Gradient tree boosting also makes it possible to use instance
weighting (as in C4.5) and surrogate splitting (as in CART) to handle
missing values. Experimental studies of the effectiveness of these
two methods (as well as standard imputation and indicator feature
methods) show that instance weighting is the best method in most cases
when feature values are missing at random.
Value Function Approximation using Multiple Aggregation for Multiattribute Resource Management
http://jmlr.org/papers/v9/george08a.html
2008
We consider the problem of estimating the value of a multiattribute
resource, where the attributes are categorical or discrete in nature
and the number of potential attribute vectors is very large. The
problem arises in approximate dynamic programming when we need to
estimate the value of a multiattribute resource from estimates based
on Monte-Carlo simulation. These problems have been traditionally
solved using aggregation, but choosing the right level of aggregation
requires resolving the classic tradeoff between aggregation error and
sampling error. We propose a method that estimates the value of a
resource at different levels of aggregation simultaneously, and then
uses a weighted combination of the estimates. Using the optimal
weights, which minimizes the variance of the estimate while accounting
for correlations between the estimates, is computationally too
expensive for practical applications. We have found that a simple
inverse variance formula (adjusted for bias), which effectively
assumes the estimates are independent, produces near-optimal
estimates. We use the setting of two levels of aggregation to explain
why this approximation works so well.
Approximations for Binary Gaussian Process Classification
http://jmlr.org/papers/v9/nickisch08a.html
2008
We provide a comprehensive overview of many recent algorithms
for approximate inference in Gaussian process models for probabilistic
binary classification. The relationships between several approaches
are elucidated theoretically, and the properties of the different
algorithms are corroborated by experimental results. We examine both
1) the quality of the predictive distributions and 2) the suitability
of the different marginal likelihood approximations for model selection
(selecting hyperparameters) and compare to a gold standard based on
MCMC. Interestingly, some methods produce good predictive distributions
although their marginal likelihood approximations are poor. Strong
conclusions are drawn about the methods: The Expectation Propagation
algorithm is almost always the method of choice unless the computational
budget is very tight. We also extend existing methods in various ways,
and provide unifying code implementing all approaches.
Consistency of Random Forests and Other Averaging Classifiers
http://jmlr.org/papers/v9/biau08a.html
2008
In the last years of his life, Leo Breiman promoted random forests for
use in classification. He suggested using averaging as a means of
obtaining good discrimination rules. The base classifiers used for
averaging are simple and randomized, often based on random samples
from the data. He left a few questions unanswered regarding the
consistency of such rules. In this paper, we give a number of theorems
that establish the universal consistency of averaging rules. We also
show that some popular classifiers, including one suggested by
Breiman, are not universally consistent.
Mixed Membership Stochastic Blockmodels
http://jmlr.org/papers/v9/airoldi08a.html
2008
Consider data consisting of pairwise measurements, such as presence or
absence of links between pairs of objects. These data arise, for
instance, in the analysis of protein interactions and gene regulatory
networks, collections of author-recipient email, and social networks.
Analyzing pairwise measurements with probabilistic models requires
special assumptions, since the usual independence or exchangeability
assumptions no longer hold. Here we introduce a class of variance
allocation models for pairwise measurements: mixed membership
stochastic blockmodels. These models combine global parameters that
instantiate dense patches of connectivity (blockmodel) with local
parameters that instantiate node-specific variability in the
connections (mixed membership). We develop a general variational
inference algorithm for fast approximate posterior inference. We
demonstrate the advantages of mixed membership stochastic blockmodels
with applications to social networks and protein interaction networks.
Complete Identification Methods for the Causal Hierarchy
http://jmlr.org/papers/v9/shpitser08a.html
2008
We consider a hierarchy of queries about causal relationships in
graphical models, where each level in the hierarchy
requires more detailed information than the one below.
The hierarchy consists of three
levels: associative relationships, derived from a joint distribution
over the observable variables;
cause-effect relationships, derived from distributions resulting
from external interventions; and
counterfactuals, derived from distributions that span multiple
"parallel worlds" and resulting from simultaneous, possibly
conflicting observations and interventions.
We completely characterize cases where a given causal query can be computed
from information lower in the hierarchy, and provide algorithms that
accomplish this computation. Specifically, we show when effects of
interventions can be computed from observational studies, and when
probabilities of counterfactuals can be computed from experimental
studies. We also provide a graphical characterization of those queries which
cannot be computed (by any method) from queries at a lower layer of the
hierarchy.
Manifold Learning: The Price of Normalization
http://jmlr.org/papers/v9/goldberg08a.html
2008
We analyze the performance of a class of manifold-learning
algorithms that find their output by minimizing a quadratic form
under some normalization constraints. This class consists of
Locally Linear Embedding (LLE), Laplacian Eigenmap, Local Tangent
Space Alignment (LTSA), Hessian Eigenmaps (HLLE), and Diffusion
maps. We present and prove conditions on the manifold that are
necessary for the success of the algorithms. Both the finite
sample case and the limit case are analyzed. We show that there
are simple manifolds in which the necessary conditions are
violated, and hence the algorithms cannot recover the underlying
manifolds. Finally, we present numerical results that demonstrate
our claims.
On Relevant Dimensions in Kernel Feature Spaces
http://jmlr.org/papers/v9/braun08a.html
2008
We show that the relevant information of a supervised learning problem
is contained up to negligible error in a finite number of leading
kernel PCA components if the kernel matches the underlying learning
problem in the sense that it can asymptotically represent the function
to be learned and is sufficiently smooth. Thus, kernels do not only
transform data sets such that good generalization can be achieved
using only linear discriminant functions, but this transformation is
also performed in a manner which makes economical use of feature space
dimensions. In the best case, kernels provide efficient implicit
representations of the data for supervised learning problems.
Practically, we propose an algorithm which enables us to recover the
number of leading kernel PCA components relevant for good
classification. Our algorithm can therefore be applied (1) to analyze
the interplay of data set and kernel in a geometric fashion, (2) to
aid in model selection, and (3) to denoise in feature space in order
to yield better classification results.
LIBLINEAR: A Library for Large Linear Classification
http://jmlr.org/papers/v9/fan08a.html
2008
LIBLINEAR is an open source library for large-scale linear
classification. It supports logistic regression and linear support
vector machines. We provide easy-to-use command-line tools and
library calls for users and developers.
Comprehensive documents are available for both beginners and advanced
users. Experiments demonstrate that LIBLINEAR is very efficient on
large sparse data sets.
Learning Balls of Strings from Edit Corrections
http://jmlr.org/papers/v9/becerra-bonache08a.html
2008
When facing the question of learning languages in realistic settings,
one has to tackle several problems that do not admit simple
solutions. On the one hand, languages are usually defined by complex
grammatical mechanisms for which the learning results are
predominantly negative, as the few algorithms are not really able to
cope with noise. On the other hand, the learning settings themselves
rely either on too simple information (text) or on unattainable one
(query systems that do not exist in practice, nor can be
simulated). We consider simple but sound classes of languages defined
via the widely used edit distance: the balls of strings. We propose to
learn them with the help of a new sort of queries, called the
correction queries: when a string is submitted to the Oracle, either
she accepts it if it belongs to the target language, or she proposes a
correction, that is, a string of the language close to the query with
respect to the edit distance. We show that even if the good balls are
not learnable in Angluin's M<small>AT</small> model, they can be learned
from a polynomial number of correction queries. Moreover, experimental
evidence simulating a human Expert shows that this algorithm is
resistant to approximate answers.
Classification with a Reject Option using a Hinge Loss
http://jmlr.org/papers/v9/bartlett08a.html
2008
We consider the problem of binary classification where the
classifier can, for a particular cost, choose not to classify an
observation. Just as in the conventional classification problem,
minimization of the sample average of the cost is a difficult
optimization problem. As an alternative, we propose the optimization
of a certain convex loss function φ, analogous to the hinge
loss used in support vector machines (SVMs). Its convexity ensures
that the sample average of this surrogate loss can be efficiently
minimized. We study its statistical properties. We show that
minimizing the expected surrogate loss—the φ-risk—also
minimizes the risk. We also study the rate at which the φ-risk
approaches its minimum value. We show that fast rates are possible
when the conditional probability <i>P</i>(<i>Y</i>=1|<i>X</i>) is unlikely to be
close to certain critical values.
Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks
http://jmlr.org/papers/v9/collins08a.html
2008
Log-linear and maximum-margin models are two commonly-used methods in
supervised machine learning, and are frequently used in structured
prediction problems. Efficient learning of parameters in these models
is therefore an important problem, and becomes a key factor when
learning from very large data sets. This paper describes
exponentiated gradient (EG) algorithms for training such models, where
EG updates are applied to the convex dual of either the log-linear or
max-margin objective function; the dual in both the log-linear and
max-margin cases corresponds to minimizing a convex function with
simplex constraints. We study both batch and online variants of the
algorithm, and provide rates of convergence for both cases. In the
max-margin case, <i>O</i>(1/ε) EG updates are required to
reach a given accuracy ε in the dual; in contrast, for
log-linear models only <i>O</i>(log(1/ε)) updates are
required. For both the max-margin and log-linear cases, our bounds
suggest that the online EG algorithm requires a factor of <i>n</i> less
computation to reach a desired accuracy than the batch EG algorithm,
where <i>n</i> is the number of training examples. Our experiments confirm
that the online algorithms are much faster than the batch algorithms
in practice. We describe how the EG updates factor in a convenient
way for structured prediction problems, allowing the algorithms to be
efficiently applied to problems such as sequence learning or natural
language parsing. We perform extensive evaluation of the algorithms,
comparing them to L-BFGS and stochastic gradient descent for
log-linear models, and to SVM-Struct for max-margin
models. The algorithms are applied to a multi-class
problem as well as to a more complex large-scale parsing task. In all
these settings, the EG algorithms presented here outperform the other
methods.
Learning from Multiple Sources
http://jmlr.org/papers/v9/crammer08a.html
2008
We consider the problem of learning accurate models from multiple
sources of "nearby" data. Given distinct samples from multiple data
sources and estimates of the dissimilarities between these sources, we
provide a general theory of which samples should be used to learn
models for each source. This theory is applicable in a broad
decision-theoretic learning framework, and yields general results for
classification and regression. A key component of our approach is the
development of approximate triangle inequalities for expected loss,
which may be of independent interest. We discuss the related problem
of learning parameters of a distribution from multiple data sources.
Finally, we illustrate our theory through a series of synthetic
simulations.
Nearly Uniform Validation Improves Compression-Based Error Bounds
http://jmlr.org/papers/v9/bax08b.html
2008
This paper develops bounds on out-of-sample error rates for support
vector machines (SVMs). The bounds are based on the numbers of support
vectors in the SVMs rather than on VC dimension. The bounds developed
here improve on support vector counting bounds derived using
Littlestone and Warmuth's compression-based bounding technique.
Regularization on Graphs with Function-adapted Diffusion Processes
http://jmlr.org/papers/v9/szlam08a.html
2008
Harmonic analysis and diffusion on discrete data has been shown to
lead to state-of-the-art algorithms for machine learning tasks,
especially in the context of semi-supervised and transductive
learning. The success of these algorithms rests on the assumption
that the function(s) to be studied (learned, interpolated, etc.) are
smooth with respect to the geometry of the data. In this paper we
present a method for modifying the given geometry so the function(s)
to be studied are smoother with respect to the modified geometry,
and thus more amenable to treatment using harmonic analysis methods.
Among the many possible applications, we consider the problems of
image denoising and transductive classification. In both settings,
our approach improves on standard diffusion based methods.
Value Function Based Reinforcement Learning in Changing Markovian Environments
http://jmlr.org/papers/v9/csaji08a.html
2008
The paper investigates the possibility of applying value function
based reinforcement learning (RL) methods in cases when the
environment may change over time. First, theorems are presented which
show that the optimal value function of a discounted Markov decision
process (MDP) Lipschitz continuously depends on the immediate-cost
function and the transition-probability function. Dependence on the
discount factor is also analyzed and shown to be
non-Lipschitz. Afterwards, the concept of (ε,δ)-MDPs
is introduced, which is a generalization of MDPs and
ε-MDPs. In this model the environment may change over
time, more precisely, the transition function and the cost function
may vary from time to time, but the changes must be bounded in the
limit. Then, learning algorithms in changing environments are
analyzed. A general relaxed convergence theorem for stochastic
iterative algorithms is presented. We also demonstrate the results
through three classical RL methods: asynchronous value iteration,
Q-learning and temporal difference learning. Finally, some numerical
experiments concerning changing environments are presented.
A New Algorithm for Estimating the Effective Dimension-Reduction Subspace
http://jmlr.org/papers/v9/dalalyan08a.html
2008
<p>
The statistical problem of estimating the effective
dimension-reduction (EDR) subspace in the multi-index regression model
with deterministic design and additive noise is considered. A new
procedure for recovering the directions of the EDR subspace is
proposed. Many methods for estimating the EDR subspace perform
principal component analysis on a family of vectors, say
β<sub>1</sub>,...,β<sub><i>L</i></sub>, nearly lying in the EDR
subspace. This is in particular the case for the structure-adaptive
approach proposed by Hristache et al. (2001a). In the present work, we propose to
estimate the projector onto the EDR subspace by the solution to the
optimization problem
</p>
<center>
minimize max<sub><i>l</i>=1,...,<i>L</i></sub>
β<sub><i>l</i></sub><sup>T</sup> (<i>I</i>-A)β<sub><i>l</i></sub>    subject to A ∈ <i>A<sub>m</sub></i>
</center>
<p>
where <i>A<sub>m</sub></i> is the set of all
symmetric matrices with eigenvalues in [0,1] and trace less than or
equal to <i>m</i>, with <i>m</i> being the true structural dimension. Under
mild assumptions, √<i>n</i>-consistency of the proposed procedure is
proved (up to a logarithmic factor) in the case when the structural
dimension is not larger than 4. Moreover, the stochastic error of
the estimator of the projector onto the EDR subspace is shown to
depend on <i>L</i> logarithmically. This enables us to use a large number
of vectors β<sub><i>l</i></sub> for estimating the EDR subspace. The
empirical behavior of the algorithm is studied through numerical
simulations.
</p>
Universal Multi-Task Kernels
http://jmlr.org/papers/v9/caponnetto08a.html
2008
In this paper we are concerned with reproducing kernel Hilbert
spaces <i>H<sub>K</sub></i> of functions from an input space into a Hilbert
space <i>Y</i>, an environment appropriate for multi-task
learning. The reproducing kernel <i>K</i> associated to <i>H<sub>K</sub></i> has
its values as operators on <i>Y</i>. Our primary goal here is to
derive conditions which ensure that the kernel <i>K</i> is universal.
This means that on every compact subset of the input space, every
continuous function with values in <i>Y</i> can be uniformly
approximated by sections of the kernel. We provide various
characterizations of universal kernels and highlight them with
several concrete examples of some practical importance. Our analysis
uses basic principles of functional analysis and especially the
useful notion of vector measures which we describe in sufficient
detail to clarify our results.
Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction
http://jmlr.org/papers/v9/zhu08a.html
2008
Existing template-independent web data extraction approaches adopt
highly ineffective decoupled strategies---attempting to do data
record detection and attribute labeling in two separate phases. In
this paper, we propose an integrated web data extraction paradigm
with hierarchical models. The proposed model is called Dynamic
Hierarchical Markov Random Fields (DHMRFs). DHMRFs take structural
uncertainty into consideration and define a joint distribution of
both model structure and class labels. The joint distribution is an
exponential family distribution. As a conditional model, DHMRFs
relax the independence assumption as made in directed models. Since
exact inference is intractable, a variational method is developed to
learn the model's parameters and to find the MAP model structure and
label assignments. We apply DHMRFs to a real-world web data
extraction task. Experimental results show that: (1) integrated web
data extraction models can achieve significant improvements on both
record detection and attribute labeling compared to decoupled
models; (2) in diverse web data extraction DHMRFs can potentially
address the blocky artifact issue which is suffered by
fixed-structured hierarchical models.
Aggregation of SVM Classifiers Using Sobolev Spaces
http://jmlr.org/papers/v9/loustau08a.html
2008
<p>
This paper investigates statistical performances of Support Vector
Machines (SVM) and considers the problem of adaptation to the margin
parameter and to complexity. In particular we provide a classifier
with no tuning parameter. It is a combination of SVM classifiers.
</p>
<p>
Our contribution is two-fold: (1) we propose learning rates for SVM
using Sobolev spaces and build a numerically realizable aggregate that
converges with same rate; (2) we present practical experiments of this
method of aggregation for SVM using both Sobolev spaces and Gaussian
kernels.
</p>
Learning to Combine Motor Primitives Via Greedy Additive Regression
http://jmlr.org/papers/v9/chhabra08a.html
2008
The computational complexities arising in motor control can be
ameliorated through the use of a library of motor synergies. We
present a new model, referred to as the Greedy Additive Regression
(GAR) model, for learning a library of torque sequences, and for
learning the coefficients of a linear combination of sequences
minimizing a cost function. From the perspective of numerical
optimization, the GAR model is interesting because it creates a
library of "local features"---each sequence in the library is a
solution to a single training task---and learns to combine these
sequences using a local optimization procedure, namely, additive
regression. We speculate that learners with local representational
primitives and local optimization procedures will show good
performance on nonlinear tasks. The GAR model is also interesting
from the perspective of motor control because it outperforms several
competing models. Results using a simulated two-joint arm suggest
that the GAR model consistently shows excellent performance in the
sense that it rapidly learns to perform novel, complex motor tasks.
Moreover, its library is overcomplete and sparse, meaning that only
a small fraction of the stored torque sequences are used when
learning a new movement. The library is also robust in the sense
that, after an initial training period, nearly all novel movements
can be learned as additive combinations of sequences in the library,
and in the sense that it shows good generalization when an arm's
dynamics are altered between training and test conditions, such as
when a payload is added to the arm. Lastly, the GAR model works well
regardless of whether motor tasks are specified in joint space or
Cartesian space. We conclude that learning techniques using local
primitives and optimization procedures are viable and potentially
important methods for motor control and possibly other domains, and
that these techniques deserve further examination by the artificial
intelligence and cognitive science communities.
Incremental Identification of Qualitative Models of Biological Systems using Inductive Logic Programming
http://jmlr.org/papers/v9/srinivasan08a.html
2008
The use of computational models is
increasingly expected to play an important role in predicting the
behaviour of biological systems.
Models are being sought at different scales of biological
organisation namely: sub-cellular,
cellular, tissue, organ, organism and ecosystem; with a view of
identifying how different components are connected together, how they
are controlled and how they behave when functioning as a system. Except
for very simple biological processes, system identification from
first principles can be extremely difficult. This has brought into
focus automated techniques for constructing
models using data of system behaviour. Such techniques face three
principal issues: (1) The model representation language must be rich enough
to capture system behaviour; (2) The system identification technique
must be powerful enough to identify substantially complex models; and (3)
There may not be sufficient data to obtain both the model's structure
and precise estimates of all of its parameters. In this paper, we
address these issues
in the following ways: (1) Models are represented in an expressive
subset of first-order logic. Specifically, they are expressed as
logic programs; (2) System identification is done using techniques
developed in Inductive Logic Programming (ILP). This allows the
identification of first-order logic models from data. Specifically, we
employ an incremental approach in which increasingly complex models are
constructed from simpler ones using snapshots of system behaviour; and
(3) We restrict ourselves to "qualitative" models. These are
non-parametric: thus, usually less data are required than for
identifying parametric quantitative models. A further advantage
is that the data need not be precise numerical observations
(instead, they are abstractions like positive, negative, zero,
increasing, decreasing and so on).
We describe incremental construction
of qualitative models
using a simple physical system and demonstrate its application
to identification of models at four scales of biological
organisation, namely: (a) a predator-prey model at the ecosystem level;
(b) a model for the human lung at the organ level; (c) a model for regulation
of glucose by insulin in the human body at the extra-cellular
level; and (d) a model for
the glycolysis metabolic pathway at the cellular level.
Causal Reasoning with Ancestral Graphs
http://jmlr.org/papers/v9/zhang08a.html
2008
Causal reasoning is primarily concerned with what would
happen to a system under external interventions. In particular, we
are often interested in predicting the probability distribution of
some random variables that would result if some other variables
were <i>forced</i> to take certain values. One prominent approach
to tackling this problem is based on causal Bayesian networks,
using directed acyclic graphs as <i>causal</i> diagrams to relate
post-intervention probabilities to pre-intervention probabilities
that are estimable from observational data. However, such causal
diagrams are seldom fully testable given observational data. In
consequence, many causal discovery algorithms based on data-mining
can only output an equivalence class of causal diagrams (rather
than a single one). This paper is concerned with causal reasoning
given an equivalence class of causal diagrams, represented by a
(partial) <i>ancestral graph</i>. We present two main results. The
first result extends Pearl (1995)'s celebrated <i>do-calculus</i>
to the context of ancestral graphs. In the second result, we focus
on a key component of Pearl's calculus---the property of
<i>invariance under interventions</i>, and give stronger graphical
conditions for this property than those implied by the first
result. The second result also improves the earlier, similar
results due to Spirtes et al. (1993).
Online Learning of Complex Prediction Problems Using Simultaneous Projections
http://jmlr.org/papers/v9/amit08a.html
2008
We describe and analyze an algorithmic framework for online classification
where each online trial consists of <i>multiple</i> prediction tasks that are
tied together. We tackle the problem of updating the online predictor by
defining a projection problem in which each prediction task corresponds to a
single linear constraint. These constraints are tied together through a single
slack parameter. We then introduce a general method for approximately solving
the problem by projecting <i>simultaneously</i> and independently on each
constraint which corresponds to a prediction sub-problem, and then averaging
the individual solutions. We show that this approach constitutes a feasible,
albeit not necessarily optimal, solution of the original projection problem.
We derive concrete simultaneous projection schemes and analyze them in the mistake
bound model. We demonstrate the power of the proposed algorithm in experiments
with synthetic data and with multiclass text categorization tasks.
Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines
http://jmlr.org/papers/v9/chang08a.html
2008
Linear support vector machines (SVM) are useful for classifying
large-scale sparse data. Problems with sparse features are common
in applications such as document classification and natural language
processing. In this paper, we propose a novel coordinate descent
algorithm for training linear SVM with the L2-loss function.
At each step, the proposed method minimizes a one-variable sub-problem
while fixing other variables. The sub-problem is solved by Newton
steps with the line search technique. The procedure globally
converges at the linear rate.
As each sub-problem involves only values of a corresponding feature, the
proposed approach is
suitable when accessing a feature is
more convenient than accessing an instance.
Experiments show that our method is more
efficient and stable than state of the art methods such as Pegasos
and TRON.
A Bahadur Representation of the Linear Support Vector Machine
http://jmlr.org/papers/v9/koo08a.html
2008
The support vector machine has been successful in a variety of
applications. Also on the theoretical front, statistical properties
of the support vector machine have been studied quite extensively
with a particular attention to its Bayes risk consistency under some
conditions. In this paper, we study somewhat basic statistical
properties of the support vector machine yet to be investigated,
namely the asymptotic behavior of the coefficients of the linear
support vector machine. A Bahadur type representation of the
coefficients is established under appropriate conditions, and their
asymptotic normality and statistical variability are derived on the
basis of the representation. These asymptotic results do not only
help further our understanding of the support vector machine, but
also they can be useful for related statistical inferences.
Using Markov Blankets for Causal Structure Learning
http://jmlr.org/papers/v9/pellet08a.html
2008
We show how a generic feature-selection algorithm returning strongly
relevant variables can be turned into a causal structure-learning
algorithm. We prove this under the Faithfulness assumption for the
data distribution. In a causal graph, the strongly relevant variables
for a node <i>X</i> are its parents, children, and children's parents (or
spouses), also known as the Markov blanket of <i>X</i>. Identifying the
spouses leads to the detection of the V-structure patterns and thus to
causal orientations. Repeating the task for all variables yields a
valid partially oriented causal graph. We first show an efficient way
to identify the spouse links. We then perform several experiments in
the continuous domain using the Recursive Feature Elimination
feature-selection algorithm with Support Vector Regression and
empirically verify the intuition of this direct (but computationally
expensive) approach. Within the same framework, we then devise a fast
and consistent algorithm, Total Conditioning (TC), and a variant,
TC<sub>bw</sub>, with an explicit backward feature-selection heuristics, for
Gaussian data. After running a series of comparative experiments on
five artificial networks, we argue that Markov blanket algorithms such
as TC/TC<sub>bw</sub> or Grow-Shrink scale better than the reference PC
algorithm and provides higher structural accuracy.
Optimal Solutions for Sparse Principal Component Analysis
http://jmlr.org/papers/v9/aspremont08a.html
2008
Given a sample covariance matrix, we examine the problem of maximizing
the variance explained by a linear combination of the input variables
while constraining the number of nonzero coefficients in this
combination. This is known as sparse principal component analysis and
has a wide array of applications in machine learning and
engineering. We formulate a new semidefinite relaxation to this
problem and derive a greedy algorithm that computes a <i>full set</i>
of good solutions for all target numbers of non zero coefficients,
with total complexity <i>O</i>(<i>n</i><sup>3</sup>), where <i>n</i>
is the number of variables. We then use the same relaxation to derive
sufficient conditions for global optimality of a solution, which can
be tested in <i>O</i>(<i>n</i><sup>3</sup>), per pattern. We discuss
applications in subset selection and sparse recovery and show on
artificial examples and biological data that our algorithm does
provide globally optimal solutions in many cases.
Maximal Causes for Non-linear Component Extraction
http://jmlr.org/papers/v9/luecke08a.html
2008
We study a generative model in which hidden causes combine
competitively to produce observations. Multiple active causes combine
to determine the value of an observed variable through a max
function, in the place where algorithms such as sparse coding,
independent component analysis, or non-negative matrix factorization
would use a sum. This max rule can represent a more realistic
model of non-linear interaction between basic components in many
settings, including acoustic and image data.
While exact maximum-likelihood learning of the parameters of this
model proves to be intractable, we show that efficient
approximations to expectation-maximization (EM) can be found in the
case of sparsely active hidden causes. One of these approximations
can be formulated as a neural network model with a generalized softmax
activation function and Hebbian learning.
Thus, we show that learning in recent softmax-like neural networks may
be interpreted as approximate maximization of a data likelihood.
We use the bars benchmark test to numerically verify our analytical
results and to demonstrate the competitiveness of the resulting
algorithms.
Finally, we show results of learning model parameters to fit acoustic
and visual data sets in which max-like component combinations arise
naturally.
Consistency of the Group Lasso and Multiple Kernel Learning
http://jmlr.org/papers/v9/bach08b.html
2008
We consider the least-square regression problem with regularization by
a block <i>l</i><sub>1</sub>-norm, that is, a sum of Euclidean norms over spaces
of dimensions larger than one. This problem, referred to as the group
Lasso, extends the usual regularization by the <i>l</i><sub>1</sub>-norm where all
spaces have dimension one, where it is commonly referred to as the
Lasso. In this paper, we study the asymptotic group selection
consistency of the group Lasso. We derive necessary and sufficient
conditions for the consistency of group Lasso under practical
assumptions, such as model mis specification. When the linear
predictors and Euclidean norms are replaced by functions and
reproducing kernel Hilbert norms, the problem is usually referred to
as multiple kernel learning and is commonly used for learning from
heterogeneous data sources and for non linear variable
selection. Using tools from functional analysis, and in particular
covar iance operators, we extend the consistency results to this
infinite dimensional case and also propose an adaptive scheme to
obtain a consistent model estimate, even when the necessary condition
required for the non adaptive scheme is not satisfied.
Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods
http://jmlr.org/papers/v9/seeger08b.html
2008
<p>
We propose a highly efficient framework for penalized likelihood kernel
methods applied to multi-class models with a large, structured set of classes.
As opposed to many previous approaches which try to decompose the fitting
problem into many smaller ones, we focus on a Newton optimization of the
complete model, making use of model structure and linear conjugate gradients
in order to approximate Newton search directions. Crucially, our learning
method is based entirely on matrix-vector multiplication primitives with the
kernel matrices and their derivatives, allowing straightforward specialization
to new kernels, and focusing code optimization efforts to these primitives
only.
</p>
<p>
Kernel parameters are learned automatically, by maximizing the cross-validation
log likelihood in a gradient-based way, and predictive probabilities are
estimated. We demonstrate our approach on large scale text classification
tasks with hierarchical structure on thousands of classes, achieving
state-of-the-art results in an order of magnitude less time than previous
work.
</p>
<p>
Parts of this work appeared in the conference paper Seeger (2007).
</p>
A Multiple Instance Learning Strategy for Combating Good Word Attacks on Spam Filters
http://jmlr.org/papers/v9/jorgensen08a.html
2008
Statistical spam filters are known to be vulnerable to adversarial
attacks. One of the more common adversarial attacks, known as
the <i>good word attack</i>, thwarts spam filters by appending to
spam messages sets of "good" words, which are words that are common
in legitimate email but rare in spam. We present a counterattack
strategy that attempts to differentiate spam from
legitimate email in the input space by transforming each email
into a bag of multiple segments, and subsequently applying
multiple instance logistic regression on the bags. We
treat each segment in the bag as an instance. An email
is classified as spam if at least one instance in the corresponding
bag is spam, and as legitimate if all the instances
in it are legitimate. We show that a classifier using our
multiple instance counterattack strategy is more robust
to good word attacks than its single instance counterpart
and other single instance learners commonly used in the spam filtering domain.
Ranking Categorical Features Using Generalization Properties
http://jmlr.org/papers/v9/sabato08a.html
2008
Feature ranking is a fundamental machine learning task with various
applications, including feature selection and decision tree learning.
We describe and analyze a new feature ranking method that supports
categorical features with a large number of possible values. We show
that existing ranking criteria rank a feature according to the
<i>training</i> error of a predictor based on the feature. This
approach can fail when ranking categorical features with many values.
We propose the Ginger ranking criterion, that estimates the
<i>generalization</i> error of the predictor associated with the Gini
index. We show that for almost all training sets, the Ginger
criterion produces an accurate estimation of the true generalization
error, regardless of the number of values in a categorical feature. We
also address the question of finding the optimal predictor that is
based on a single categorical feature. It is shown that the predictor
associated with the misclassification error criterion has the minimal
expected generalization error. We bound the bias of this predictor
with respect to the generalization error of the Bayes optimal
predictor, and analyze its concentration properties. We demonstrate
the efficiency of our approach for feature selection and for learning
decision trees in a series of experiments with synthetic and natural
data sets.
Learning Similarity with Operator-valued Large-margin Classifiers
http://jmlr.org/papers/v9/maurer08a.html
2008
A method is introduced to learn and represent similarity with linear
operators in kernel induced Hilbert spaces. Transferring error bounds for
vector valued large-margin classifiers to the setting of Hilbert-Schmidt
operators leads to dimension free bounds on a risk functional for linear
representations and motivates a regularized objective functional.
Minimization of this objective is effected by a simple technique of
stochastic gradient descent. The resulting representations are tested on
transfer problems in image processing, involving plane and spatial geometric
invariants, handwritten characters and face recognition.
Consistency of Trace Norm Minimization
http://jmlr.org/papers/v9/bach08a.html
2008
Regularization by the sum of singular values, also referred to as the
<i>trace norm</i>, is a popular technique for estimating low rank
rectangular matrices. In this paper, we extend some of the consistency
results of the Lasso to provide necessary and sufficient conditions
for rank consistency of trace norm minimization with the square loss.
We also provide an adaptive version that is rank consistent even when
the necessary condition for the non adaptive version is not fulfilled.
Hit Miss Networks with Applications to Instance Selection
http://jmlr.org/papers/v9/marchiori08a.html
2008
In supervised learning, a training set consisting of labeled instances
is used by a learning algorithm for generating a model (classifier)
that is subsequently employed for deciding the class label of new
instances (for generalization). Characteristics of the training set,
such as presence of noisy instances and size, influence the learning
algorithm and affect generalization performance. This paper introduces
a new network-based representation of a training set, called hit miss
network (HMN), which provides a compact description of the nearest
neighbor relation over pairs of instances from each pair of classes.
We show that structural properties of HMN's correspond to properties
of training points related to the one nearest neighbor (1-NN) decision
rule, such as being border or central point. This motivates us to use
HMN's for improving the performance of a 1-NN, classifier by removing
instances from the training set (instance selection). We introduce
three new HMN-based algorithms for instance selection. HMN-C, which
removes instances without affecting accuracy of 1-NN on the original
training set, HMN-E, based on a more aggressive storage reduction, and
HMN-EI, which applies iteratively HMN-E. Their performance is assessed
on 22 data sets with different characteristics, such as input
dimension, cardinality, class balance, number of classes, noise
content, and presence of redundant variables. Results of experiments
on these data sets show that accuracy of 1-NN classifier increases
significantly when HMN-EI is applied. Comparison with
state-of-the-art editing algorithms for instance selection on these
data sets indicates best generalization performance of HMN-EI and no
significant difference in storage requirements. In general, these
results indicate that HMN's provide a powerful graph-based
representation of a training set, which can be successfully applied
for performing noise and redundance reduction in instance-based
learning.
Shark
http://jmlr.org/papers/v9/igel08a.html
2008
SHARK is an object-oriented library for the design of adaptive
systems. It comprises methods for single- and multi-objective
optimization (e.g., evolutionary and gradient-based algorithms) as
well as kernel-based methods, neural networks, and other machine
learning techniques.
Search for Additive Nonlinear Time Series Causal Models
http://jmlr.org/papers/v9/chu08a.html
2008
Pointwise consistent, feasible procedures for estimating
contemporaneous linear causal structure from time series data have
been developed using multiple conditional independence tests, but no
such procedures are available for non-linear systems. We describe a
feasible procedure for learning a class of non-linear time series
structures, which we call additive non-linear time series. We show
that for data generated from stationary models of this type, two
classes of conditional independence relations among time series
variables and their lags can be tested efficiently and consistently
using tests based on additive model regression. Combining results of
statistical tests for these two classes of conditional independence
relations and the temporal structure of time series data, a new
consistent model specification procedure is able to extract relatively
detailed causal information. We investigate the finite sample behavior
of the procedure through simulation, and illustrate the application of
this method through analysis of the possible causal connections among
four ocean indices. Several variants of the procedure are also
discussed.
Accelerated Neural Evolution through Cooperatively Coevolved Synapses
http://jmlr.org/papers/v9/gomez08a.html
2008
Many complex control problems require sophisticated solutions that are not
amenable to traditional controller design. Not only is it difficult
to model real world systems, but often it is unclear what kind of
behavior is required to solve the task. Reinforcement learning (RL)
approaches have made progress by using direct interaction with
the task environment, but have so far not scaled well to large state
spaces and environments that are not fully observable. In recent
years, neuroevolution, the artificial evolution of neural networks,
has had remarkable success in tasks that exhibit these two properties.
In this paper, we compare a neuroevolution method
called Cooperative Synapse Neuroevolution (CoSyNE), that uses
cooperative coevolution at the level of individual synaptic weights,
to a broad range of reinforcement learning
algorithms on very difficult versions of the pole balancing problem
that involve large (continuous) state spaces and
hidden state. CoSyNE is shown to be significantly more efficient and powerful
than the other methods on these tasks.
Bouligand Derivatives and Robustness of Support Vector Machines for Regression
http://jmlr.org/papers/v9/christmann08a.html
2008
We investigate robustness properties for a broad
class of support vector machines with non-smooth loss functions.
These kernel methods are inspired by convex risk minimization in
infinite dimensional Hilbert spaces. Leading examples are the
support vector machine based on the <u>ε</u>-insensitive loss function,
and kernel based quantile regression based on the pinball loss
function. Firstly, we propose with the Bouligand influence function
(BIF) a modification of F.R. Hampel's influence function. The BIF
has the advantage of being positive homogeneous which is in general
not true for Hampel's influence function. Secondly, we show that
many support vector machines based on a Lipschitz continuous loss
function and a bounded kernel have a bounded BIF and are thus robust
in the sense of robust statistics based on influence functions.
Graphical Methods for Efficient Likelihood Inference in Gaussian Covariance Models
http://jmlr.org/papers/v9/drton08a.html
2008
In graphical modelling, a bi-directed graph encodes marginal
independences among random variables that are identified with the
vertices of the graph. We show how to transform a bi-directed graph
into a maximal ancestral graph that (i) represents the same
independence structure as the original bi-directed graph, and (ii)
minimizes the number of arrowheads among all ancestral graphs
satisfying (i). Here the number of arrowheads of an ancestral graph is
the number of directed edges plus twice the number of bi-directed
edges. In Gaussian models, this construction can be used for more
efficient iterative maximization of the likelihood function and to
determine when maximum likelihood estimates are equal to empirical
counterparts.
An Error Bound Based on a Worst Likely Assignment
http://jmlr.org/papers/v9/bax08a.html
2008
This paper introduces a new PAC transductive error bound for
classification. The method uses information from the training examples
and inputs of working examples to develop a set of likely assignments
to outputs of the working examples. A likely assignment with maximum
error determines the bound. The method is very effective for small
data sets.
Finite-Time Bounds for Fitted Value Iteration
http://jmlr.org/papers/v9/munos08a.html
2008
In this paper we develop a theoretical analysis of the performance of
sampling-based fitted value iteration (FVI) to solve infinite
state-space, discounted-reward Markovian decision processes (MDPs)
under the assumption that a generative model of the environment is
available. Our main results come in the form of finite-time bounds on
the performance of two versions of sampling-based FVI. The
convergence rate results obtained allow us to show that both versions
of FVI are well behaving in the sense that by using a sufficiently
large number of samples for a large class of MDPs, arbitrary good
performance can be achieved with high probability. An important
feature of our proof technique is that it permits the study of
weighted <i>L<sup>p</sup></i>-norm performance bounds. As a result, our technique
applies to a large class of function-approximation methods (e.g.,
neural networks, adaptive regression trees, kernel machines, locally
weighted learning), and our bounds scale well with the effective
horizon of the MDP. The bounds show a dependence on the stochastic
stability properties of the MDP: they scale with the
discounted-average concentrability of the future-state
distributions. They also depend on a new measure of the approximation
power of the function space, the inherent Bellman residual, which
reflects how well the function space is "aligned" with the dynamics
and rewards of the MDP. The conditions of the main result, as well as
the concepts introduced in the analysis, are extensively discussed and
compared to previous theoretical results. Numerical experiments are
used to substantiate the theoretical findings.
Bayesian Inference and Optimal Design for the Sparse Linear Model
http://jmlr.org/papers/v9/seeger08a.html
2008
<p>
The linear model with sparsity-favouring prior on the coefficients has
important applications in many different domains. In machine learning,
most methods to date search for maximum a posteriori sparse solutions and
neglect to represent posterior uncertainties. In this paper, we address
problems of Bayesian optimal design (or experiment planning), for which
accurate estimates of uncertainty are essential. To this end, we employ
expectation propagation approximate inference for the linear model with
Laplace prior, giving new insight into numerical stability properties and
proposing a robust algorithm. We also show how to estimate model
hyperparameters by empirical Bayesian maximisation of the marginal likelihood,
and propose ideas in order to scale up the method to very large
underdetermined problems.
</p>
<p>
We demonstrate the versatility of our framework on the application of
gene regulatory network identification from micro-array expression data,
where both the Laplace prior and the active experimental design approach are
shown to result in significant improvements. We also address the problem of
sparse coding of natural images, and show how our framework can be used
for compressive sensing tasks.
</p>
<p>
Part of this work appeared in Seeger et al. (2007b). The gene network
identification application appears in Steinke et al. (2007).
</p>
Multi-class Discriminant Kernel Learning via Convex Programming
http://jmlr.org/papers/v9/ye08b.html
2008
Regularized kernel discriminant analysis (RKDA) performs linear
discriminant analysis in the feature space via the kernel trick. Its
performance depends on the selection of kernels. In this paper, we
consider the problem of multiple kernel learning (MKL) for RKDA, in
which the optimal kernel matrix is obtained as a linear combination
of pre-specified kernel matrices. We show that the kernel learning
problem in RKDA can be formulated as convex programs. First, we show
that this problem can be formulated as a semidefinite program (SDP).
Based on the equivalence relationship between RKDA and least square
problems in the binary-class case, we propose a convex quadratically
constrained quadratic programming (QCQP) formulation for kernel
learning in RKDA. A semi-infinite linear programming (SILP)
formulation is derived to further improve the efficiency. We extend
these formulations to the multi-class case based on a key result
established in this paper. That is, the multi-class RKDA kernel
learning problem can be decomposed into a set of binary-class kernel
learning problems which are constrained to share a common kernel.
Based on this decomposition property, SDP formulations are proposed
for the multi-class case. Furthermore, it leads naturally to QCQP
and SILP formulations. As the performance of RKDA depends on the
regularization parameter, we show that this parameter can also be
optimized in a joint framework with the kernel. Extensive
experiments have been conducted and analyzed, and connections to
other algorithms are discussed.
Learning Control Knowledge for Forward Search Planning
http://jmlr.org/papers/v9/yoon08a.html
2008
A number of today's state-of-the-art planners are based on forward
state-space search. The impressive performance can be attributed to
progress in computing domain independent heuristics that perform well
across many domains. However, it is easy to find domains where such
heuristics provide poor guidance, leading to planning failure.
Motivated by such failures, the focus of this paper is to investigate
mechanisms for learning domain-specific knowledge to better control
forward search in a given domain. While there has been a large body
of work on inductive learning of control knowledge for AI planning,
there is a void of work aimed at forward-state-space search. One
reason for this may be that it is challenging to specify a knowledge
representation for compactly representing important concepts across a
wide range of domains. One of the main contributions of this work is
to introduce a novel feature space for representing such control
knowledge. The key idea is to define features in terms of information
computed via relaxed plan extraction, which has been a major source of
success for non-learning planners. This gives a new way of leveraging
relaxed planning techniques in the context of learning. Using this
feature space, we describe three forms of control knowledge---reactive
policies (decision list rules and measures of progress) and linear
heuristics---and show how to learn them and incorporate them into
forward state-space search. Our empirical results show that our
approaches are able to surpass state-of-the-art non-learning planners
across a wide range of planning competition domains.
Graphical Models for Structured Classification, with an Application to Interpreting Images of Protein Subcellular Location Patterns
http://jmlr.org/papers/v9/chen08a.html
2008
In structured classification problems, there is a direct conflict
between expressive models and efficient inference: while graphical
models such as Markov random fields or factor graphs can represent
arbitrary dependences among instance labels, the cost of inference via
belief propagation in these models grows rapidly as the graph
structure becomes more complicated. One important source of
complexity in belief propagation is the need to marginalize large
factors to compute messages. This operation takes time exponential in
the number of variables in the factor, and can limit the
expressiveness of the models we can use. In this paper, we study a
new class of potential functions, which we call decomposable
<i>k</i>-way potentials, and provide efficient algorithms for
computing messages from these potentials during belief propagation.
We believe these new potentials provide a good balance between
expressive power and efficient inference in practical structured
classification problems. We discuss three instances of decomposable
potentials: the associative Markov network potential, the nested
junction tree, and a new type of potential which we call the voting
potential. We use these potentials to classify images of protein
subcellular location patterns in groups of cells. Classifying
subcellular location patterns can help us answer many important
questions in computational biology, including questions about how
various treatments affect the synthesis and behavior of proteins and
networks of proteins within a cell. Our new representation and
algorithm lead to substantial improvements in both inference speed and
classification accuracy.
Trust Region Newton Method for Logistic Regression
http://jmlr.org/papers/v9/lin08b.html
2008
Large-scale logistic regression arises in many applications such as
document classification and natural language processing. In this
paper, we apply a trust region Newton method to maximize the
log-likelihood of the logistic regression model. The proposed method
uses only approximate Newton steps in the beginning, but achieves fast
convergence in the end. Experiments show that it is faster than the
commonly used quasi Newton approach for logistic regression. We also
extend the proposed method to large-scale L2-loss linear support
vector machines (SVM).
A Library for Locally Weighted Projection Regression
http://jmlr.org/papers/v9/klanke08a.html
2008
In this paper we introduce an improved implementation of locally weighted
projection regression (LWPR), a supervised learning algorithm that is
capable of handling high-dimensional input data. As the key features,
our code supports multi-threading, is available for multiple
platforms, and provides wrappers for several programming languages.
Learning Reliable Classifiers From Small or Incomplete Data Sets: The Naive Credal Classifier 2
http://jmlr.org/papers/v9/corani08a.html
2008
In this paper, the naive credal classifier, which is a
set-valued counterpart of naive Bayes, is extended to a general and
flexible treatment of incomplete data, yielding a new classifier
called <i>naive credal classifier 2</i> (NCC2). The new classifier delivers
classifications that are reliable even in the presence of small sample
sizes and missing values. Extensive empirical evaluations show that,
by issuing set-valued classifications, NCC2 is able to isolate and
properly deal with instances that are hard to classify (on which naive
Bayes accuracy drops considerably), and to perform as well as naive
Bayes on the other instances. The experiments point to a general
problem: they show that with missing values, empirical evaluations may
not reliably estimate the accuracy of a traditional classifier, such
as naive Bayes. This phenomenon adds even more value to the robust
approach to classification implemented by NCC2.
Closed Sets for Labeled Data
http://jmlr.org/papers/v9/garriga08a.html
2008
Closed sets have been proven successful in the context of compacted
data representation for association rule learning. However, their use
is mainly descriptive, dealing only with unlabeled data. This paper
shows that when considering labeled data, closed sets can be adapted
for classification and discrimination purposes by conveniently
contrasting covering properties on positive and negative examples. We
formally prove that these sets characterize the space of relevant
combinations of features for discriminating the target class. In
practice, identifying relevant/irrelevant combinations of features
through closed sets is useful in many applications:
to compact emerging
patterns of typical descriptive mining applications, to reduce the number
of essential rules in classification, and to efficiently learn
subgroup descriptions, as demonstrated in real-life subgroup discovery
experiments on a high dimensional microarray data set.
An Information Criterion for Variable Selection in Support Vector Machines
http://jmlr.org/papers/v9/claeskens08a.html
2008
Support vector machines for classification have the advantage that
the curse of dimensionality is circumvented. It has been shown that
a reduction of the dimension of the input space leads to even better
results. For this purpose, we propose two information criteria which
can be computed directly from the definition of the support vector
machine. We assess the predictive performance of the models selected
by our new criteria and compare them to existing variable selection
techniques in a simulation study. The simulation results show that
the new criteria are competitive in terms of generalization error
rate while being much easier to compute. We arrive at the same
findings for comparison on some real-world benchmark data sets.
Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers
http://jmlr.org/papers/v9/jiang08a.html
2008
Support vector machine (SVM) is one of the most popular and
promising classification algorithms. After a classification rule is
constructed via the SVM, it is essential to evaluate its prediction
accuracy. In this paper, we develop procedures for obtaining both
point and interval estimators for the prediction error. Under mild
regularity conditions, we derive the consistency and asymptotic
normality of the prediction error estimators for SVM with
finite-dimensional kernels. A perturbation-resampling procedure is
proposed to obtain interval estimates for the prediction error in
practice. With numerical studies on simulated data and a benchmark
repository, we recommend the use of interval estimates centered at
the cross-validated point estimates for the prediction error.
Further applications of the proposed procedure in model evaluation
and feature selection are illustrated with two examples.
Comments on the Complete Characterization of a Family of Solutions to a Generalized Fisher Criterion
http://jmlr.org/papers/v9/ye08a.html
2008
Loog (2007) provided a complete characterization of the
family of solutions to a generalized <i>Fisher</i> criterion. We show
that this characterization is essentially equivalent to the original
characterization proposed in Ye (2005). The computational
advantage of the original characterization over the new one is
discussed, which justifies its practical use.
Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data
http://jmlr.org/papers/v9/banerjee08a.html
2008
We consider the problem of estimating the parameters of a Gaussian or
binary distribution in such a way that the resulting undirected
graphical model is sparse. Our approach is to solve a maximum
likelihood problem with an added <i>l</i><sub>1</sub>-norm penalty term. The
problem as formulated is convex but the memory requirements and
complexity of existing interior point methods are prohibitive for
problems with more than tens of nodes. We present two new algorithms
for solving problems with at least a thousand nodes in the Gaussian
case. Our first algorithm uses block coordinate descent, and can be
interpreted as recursive <i>l</i><sub>1</sub>-norm penalized regression. Our
second algorithm, based on Nesterov's first order method, yields a
complexity estimate with a better dependence on problem size than
existing interior point methods. Using a log determinant relaxation
of the log partition function (Wainwright and Jordan, 2006), we show that these
same algorithms can be used to solve an approximate sparse maximum
likelihood problem for the binary case. We test our algorithms on
synthetic data, as well as on gene expression and senate voting
records data.
A Recursive Method for Structural Learning of Directed Acyclic Graphs
http://jmlr.org/papers/v9/xie08a.html
2008
In this paper, we propose a recursive method for structural learning
of directed acyclic graphs (DAGs), in which a problem of structural
learning for a large DAG is first decomposed into two problems of
structural learning for two small vertex subsets, each of which is
then decomposed recursively into two problems of smaller subsets
until none subset can be decomposed further. In our approach, search
for separators of a pair of variables in a large DAG is localized to
small subsets, and thus the approach can improve the efficiency of
searches and the power of statistical tests for structural learning.
We show how the recent advances in the learning of undirected
graphical models can be employed to facilitate the decomposition.
Simulations are given to demonstrate the performance of the proposed
method.
Theoretical Advantages of Lenient Learners: An Evolutionary Game Theoretic Perspective
http://jmlr.org/papers/v9/panait08a.html
2008
This paper presents the dynamics of multiple learning agents from an
evolutionary game theoretic perspective. We provide replicator
dynamics models for cooperative coevolutionary algorithms and for
traditional multiagent Q-learning, and we extend these differential
equations to account for lenient learners: agents that forgive
possible mismatched teammate actions that resulted in low rewards. We
use these extended formal models to study the convergence guarantees
for these algorithms, and also to visualize the basins of attraction
to optimal and suboptimal solutions in two benchmark coordination
problems. The paper demonstrates that lenience provides learners with
more accurate information about the benefits of performing their
actions, resulting in higher likelihood of convergence to the globally
optimal solution. In addition, the analysis indicates that the choice
of learning algorithm has an insignificant impact on the overall
performance of multiagent learning algorithms; rather, the performance
of these algorithms depends primarily on the level of lenience that
the agents exhibit to one another. Finally, the research herein
supports the strength and generality of evolutionary game theory as a
backbone for multiagent learning.
A Tutorial on Conformal Prediction
http://jmlr.org/papers/v9/shafer08a.html
2008
<p>
Conformal prediction uses past experience to determine precise levels
of confidence in new predictions. Given an error probability
ε, together with a method that makes a prediction <i>ŷ</i>
of a label <i>y</i>, it produces a set of labels, typically containing
<i>ŷ</i>, that also contains <i>y</i> with probability 1 – ε.
Conformal prediction can be applied to any method for producing
<i>ŷ</i>: a nearest-neighbor method, a support-vector machine, ridge
regression, etc.
</p>
<p>
Conformal prediction is designed for an on-line setting in which
labels are predicted successively, each one being revealed before the
next is predicted. The most novel and valuable feature of conformal
prediction is that if the successive examples are sampled
independently from the same distribution, then the successive
predictions will be right 1 – ε of the time, even though they
are based on an accumulating data set rather than on independent data
sets.
</p>
<p>
In addition to the model under which successive examples are sampled
independently, other on-line compression models can also use conformal
prediction. The widely used Gaussian linear model is one of these.
</p>
<p>
This tutorial presents a self-contained account of the theory of
conformal prediction and works through several numerical examples. A
more comprehensive treatment of the topic is provided in
<i>Algorithmic Learning in a Random World</i>, by Vladimir Vovk,
Alex Gammerman, and Glenn Shafer (Springer, 2005).
</p>
Generalization from Observed to Unobserved Features by Clustering
http://jmlr.org/papers/v9/krupka08a.html
2008
We argue that when objects are characterized by many attributes, clustering
them on the basis of a <i>random</i> subset of these attributes can
capture information on the unobserved attributes as well. Moreover,
we show that under mild technical conditions, clustering the objects
on the basis of such a random subset performs almost as well as clustering
with the full attribute set. We prove finite sample generalization
theorems for this novel learning scheme that extends analogous results
from the supervised learning setting. We use our framework to analyze
generalization to unobserved features of two well-known clustering
algorithms: <i>k</i>-means and the maximum likelihood multinomial mixture
model. The scheme is demonstrated for collaborative filtering of users
with movie ratings as attributes and document clustering with words
as attributes.
Algorithms for Sparse Linear Classifiers in the Massive Data Setting
http://jmlr.org/papers/v9/balakrishnan08a.html
2008
Classifiers favoring sparse solutions, such as support vector
machines, relevance vector machines, LASSO-regression based
classifiers, etc., provide competitive methods for
classification problems in high dimensions. However, current
algorithms for training sparse classifiers typically scale quite
unfavorably with respect to the number of training examples. This
paper proposes online and multi-pass algorithms for training
sparse linear classifiers for high dimensional data. These
algorithms have computational complexity and memory requirements
that make learning on massive data sets feasible. The central idea
that makes this possible is a straightforward quadratic
approximation to the likelihood function.
Support Vector Machinery for Infinite Ensemble Learning
http://jmlr.org/papers/v9/lin08a.html
2008
Ensemble learning algorithms such as boosting can achieve better
performance by averaging over the predictions of some base hypotheses.
Nevertheless, most existing algorithms are limited to combining only a
finite number of hypotheses, and the generated ensemble is usually
sparse. Thus, it is not clear whether we should construct an ensemble
classifier with a larger or even an infinite number of hypotheses. In
addition, constructing an infinite ensemble itself is a challenging
task. In this paper, we formulate an infinite ensemble learning
framework based on the support vector machine (SVM). The framework
can output an infinite and nonsparse ensemble through embedding
infinitely many hypotheses into an SVM kernel. We use the framework
to derive two novel kernels, the stump kernel and the perceptron
kernel. The stump kernel embodies infinitely many decision stumps,
and the perceptron kernel embodies infinitely many perceptrons. We
also show that the Laplacian radial basis function kernel embodies
infinitely many decision trees, and can thus be explained through
infinite ensemble learning. Experimental results show that SVM with
these kernels is superior to boosting with the same base hypothesis
set. In addition, SVM with the stump kernel or the perceptron kernel
performs similarly to SVM with the Gaussian radial basis function
kernel, but enjoys the benefit of faster parameter selection. These
properties make the novel kernels favorable choices in practice.
Near-Optimal Sensor Placements in Gaussian Processes: Theory, Efficient Algorithms and Empirical Studies
http://jmlr.org/papers/v9/krause08a.html
2008
When monitoring spatial phenomena, which can often be modeled as
Gaussian processes (GPs), choosing sensor locations is a fundamental
task. There are several common strategies to address this task, for
example, geometry or disk models, placing sensors at the points of
highest entropy (variance) in the GP model, and A-, D-, or E-optimal
design. In this paper, we tackle the combinatorial optimization
problem of maximizing the <i>mutual information</i> between the
chosen locations and the locations which are not selected. We prove
that the problem of finding the configuration that maximizes mutual
information is NP-complete. To address this issue, we describe a
polynomial-time approximation that is within (1-1/<i>e</i>) of the
optimum by exploiting the <i>submodularity</i> of mutual
information. We also show how submodularity can be used to obtain
online bounds, and design branch and bound search procedures. We
then extend our algorithm to exploit lazy evaluations and local
structure in the GP, yielding significant speedups. We also extend
our approach to find placements which are robust against node
failures and uncertainties in the model. These extensions are again
associated with rigorous theoretical approximation guarantees,
exploiting the submodularity of the objective function. We
demonstrate the advantages of our approach towards optimizing mutual
information in a very extensive empirical study on two real-world
data sets.
Optimization Techniques for Semi-Supervised Support Vector Machines
http://jmlr.org/papers/v9/chapelle08a.html
2008
Due to its wide applicability, the problem of semi-supervised
classification is attracting increasing attention in machine learning.
Semi-Supervised Support Vector Machines (S<sup>3</sup>VMs) are based on applying
the margin maximization principle to both labeled and unlabeled
examples. Unlike SVMs, their formulation leads to a non-convex
optimization problem. A suite of algorithms have recently been
proposed for solving S<sup>3</sup>VMs. This paper reviews key ideas in this
literature. The performance and behavior of various S<sup>3</sup>VMs algorithms
is studied together, under a common experimental setting.
Evidence Contrary to the Statistical View of Boosting
http://jmlr.org/papers/v9/mease08a.html
2008
The statistical perspective on boosting algorithms focuses on
optimization, drawing parallels with maximum likelihood estimation
for logistic regression. In this paper we present empirical evidence
that raises questions about this view. Although the statistical
perspective provides a theoretical framework within which it is
possible to derive theorems and create new algorithms in general
contexts, we show that there remain many unanswered important
questions. Furthermore, we provide examples that reveal crucial
flaws in the many practical suggestions and new methods that are
derived from the statistical view. We perform carefully designed
experiments using simple simulation models to illustrate some of
these flaws and their practical consequences.
Active Learning by Spherical Subdivision
http://jmlr.org/papers/v9/henrich08a.html
2008
We introduce a computationally feasible, "constructive" active
learning method for binary classification. The learning algorithm is
initially formulated for separable classification problems, for a
hyperspherical data space with constant data density, and for great
spheres as classifiers. In order to reduce computational complexity
the version space is restricted to spherical simplices and learning
procedes by subdividing the edges of maximal length. We show that this
procedure optimally reduces a tight upper bound on the generalization
error. The method is then extended to other separable classification
problems using products of spheres as data spaces and isometries
induced by charts of the sphere. An upper bound is provided for the
probability of disagreement between classifiers (hence the
generalization error) for non-constant data densities on the sphere.
The emphasis of this work lies on providing mathematically exact
performance estimates for active learning strategies.
Discriminative Learning of Max-Sum Classifiers
http://jmlr.org/papers/v9/franc08a.html
2008
The max-sum classifier predicts <i>n</i>-tuple of labels from <i>n</i>-tuple of
observable variables by maximizing a sum of quality functions defined over
neighbouring pairs of labels and observable variables.
Predicting labels as MAP assignments of a Random Markov Field is a
particular example of the max-sum classifier.
Learning parameters of the max-sum classifier is a challenging problem
because even computing the response of such classifier is NP-complete
in general. Estimating parameters using the Maximum Likelihood
approach is feasible only for a subclass of max-sum classifiers with
an acyclic structure of neighbouring pairs. Recently, the discriminative methods
represented by the perceptron and the Support Vector Machines, originally
designed for binary linear classifiers, have been extended for learning some
subclasses of the max-sum classifier. Besides the max-sum classifiers with the
acyclic neighbouring structure, it has been shown that the discriminative
learning is possible even with arbitrary neighbouring structure provided the
quality functions fulfill some additional constraints. In this article, we extend the
discriminative approach to other three classes of max-sum classifiers with
an arbitrary neighbourhood structure. We derive learning algorithms for two
subclasses of max-sum classifiers whose response can be computed in polynomial
time: (i) the max-sum classifiers with supermodular quality functions and
(ii) the max-sum classifiers whose response can be computed exactly by a
linear programming relaxation. Moreover, we show that the learning problem
can be approximately solved even for a general max-sum classifier.
On the Suitable Domain for SVM Training in Image Coding
http://jmlr.org/papers/v9/camps-valls08a.html
2008
<p>
Conventional SVM-based image coding methods are founded on
independently restricting the distortion in every image
coefficient at some particular image representation.
Geometrically, this implies allowing arbitrary signal distortions
in an <i>n</i>-dimensional rectangle defined by the
ε-insensitivity zone in each dimension of the selected
image representation domain. Unfortunately, not every image
representation domain is well-suited for such a simple,
scalar-wise, approach because statistical and/or perceptual
interactions between the coefficients may exist. These
interactions imply that scalar approaches may induce distortions
that do not follow the image statistics and/or are perceptually
annoying. Taking into account these relations would imply using
non-rectangular ε-insensitivity regions (allowing
coupled distortions in different coefficients), which is beyond
the conventional SVM formulation.
</p>
<p>
In this paper, we report a condition on the suitable domain for
developing efficient SVM image coding schemes.
We analytically demonstrate that no linear domain fulfills this
condition because of the statistical and perceptual
inter-coefficient relations that exist in these domains.
This theoretical result is experimentally
confirmed by comparing SVM learning in previously reported linear
domains and in a recently proposed non-linear perceptual domain
that simultaneously reduces the statistical and perceptual
relations (so it is closer to fulfilling the proposed condition).
These results highlight the relevance of an appropriate choice of
the image representation before SVM learning.
</p>
Linear-Time Computation of Similarity Measures for Sequential Data
http://jmlr.org/papers/v9/rieck08a.html
2008
<p>
Efficient and expressive comparison of sequences is an essential
procedure for learning with sequential data. In this article we
propose a generic framework for computation of similarity measures for
sequences, covering various kernel, distance and non-metric similarity
functions. The basis for comparison is embedding of sequences using a
formal language, such as a set of natural words, <i>k</i>-grams or all
contiguous subsequences. As realizations of the framework we provide
linear-time algorithms of different complexity and capabilities using
sorted arrays, tries and suffix trees as underlying data structures.
</p>
<p>
Experiments on data sets from bioinformatics, text processing and
computer security illustrate the efficiency of the proposed
algorithms---enabling peak performances of up to 10<sup>6</sup>
pairwise comparisons per second. The utility of distances and
non-metric similarity measures for sequences as alternatives to string
kernels is demonstrated in applications of text categorization,
network intrusion detection and transcription site recognition in DNA.
</p>
Max-margin Classification of Data with Absent Features
http://jmlr.org/papers/v9/chechik08a.html
2008
We consider the problem of learning classifiers in structured domains,
where some objects have a subset of features that are inherently
absent due to complex relationships between the features. Unlike the
case where a feature exists but its value is not observed, here we
focus on the case where a feature may not even exist (structurally
absent) for some of the samples. The common approach for handling
missing features in discriminative models is to first complete their
unknown values, and then use a standard classification procedure over
the completed data. This paper focuses on features that are known to
be non-existing, rather than have an unknown value. We show how
incomplete data can be classified <i>directly</i> without any
completion of the missing features using a max-margin learning
framework. We formulate an objective function, based on the geometric
interpretation of the margin, that aims to maximize the margin of each
sample in its own relevant subspace. In this formulation, the linearly
separable case can be transformed into a binary search over a series
of second order cone programs (SOCP), a convex problem that can be
solved efficiently. We also describe two approaches for optimizing the
general case: an approximation that can be solved as a standard
quadratic program (QP) and an iterative approach for solving the exact
problem. By avoiding the pre-processing phase in which the data is
completed, both of these approaches could offer considerable
computational savings. More importantly, we show that the elegant
handling of missing values by our approach allows it to both
outperform other methods when the missing values have non-trivial
structure, and be competitive with other methods when the values are
missing at random. We demonstrate our results on several standard
benchmarks and two real-world problems: edge prediction in metabolic
pathways, and automobile detection in natural images.