http://www.jmlr.org
JMLRJournal of Machine Learning Research
Gradient Methods Never Overfit On Separable Data
http://jmlr.org/papers/v22/20-997.html
2021Ohad Shamir
A line of recent works established that when training linear predictors over separable data, using gradient methods and exponentially-tailed losses, the predictors asymptotically converge in direction to the max-margin predictor. As a consequence, the predictors asymptotically do not overfit. However, this does not address the question of whether overfitting might occur non-asymptotically, after some bounded number of iterations. In this paper, we formally show that standard gradient methods (in particular, gradient flow, gradient descent and stochastic gradient descent) *never* overfit on separable data: If we run these methods for $T$ iterations on a dataset of size $m$, both the empirical risk and the generalization error decrease at an essentially optimal rate of $\tilde{\mathcal{O}}(1/\gamma^2 T)$ up till $T\approx m$, at which point the generalization error remains fixed at an essentially optimal level of $\tilde{\mathcal{O}}(1/\gamma^2 m)$ regardless of how large $T$ is. Along the way, we present non-asymptotic bounds on the number of margin violations over the dataset, and prove their tightness.
Variance Reduced Median-of-Means Estimator for Byzantine-Robust Distributed Inference
http://jmlr.org/papers/v22/20-950.html
2021Jiyuan Tu, Weidong Liu, Xiaojun Mao, Xi Chen
This paper develops an efficient distributed inference algorithm, which is robust against a moderate fraction of Byzantine nodes, namely arbitrary and possibly adversarial machines in a distributed learning system. In robust statistics, the median-of-means (MOM) has been a popular approach to hedge against Byzantine failures due to its ease of implementation and computational efficiency. However, the MOM estimator has the shortcoming in terms of statistical efficiency. The first main contribution of the paper is to propose a variance reduced median-of-means (VRMOM) estimator, which improves the statistical efficiency over the vanilla MOM estimator and is computationally as efficient as the MOM. Based on the proposed VRMOM estimator, we develop a general distributed inference algorithm that is robust against Byzantine failures. Theoretically, our distributed algorithm achieves a fast convergence rate with only a constant number of rounds of communications. We also provide the asymptotic normality result for the purpose of statistical inference. To the best of our knowledge, this is the first normality result in the setting of Byzantine-robust distributed learning. The simulation results are also presented to illustrate the effectiveness of our method.
Statistical Query Lower Bounds for Tensor PCA
http://jmlr.org/papers/v22/20-837.html
2021Rishabh Dudeja, Daniel Hsu
In the Tensor PCA problem introduced by Richard and Montanari (2014), one is given a dataset consisting of $n$ samples $\mathbf{T}_{1:n}$ of i.i.d. Gaussian tensors of order $k$ with the promise that $\mathbb{E}\mathbf{T}_1$ is a rank-1 tensor and $\|\mathbb{E} \mathbf{T}_1\| = 1$. The goal is to estimate $\mathbb{E} \mathbf{T}_1$. This problem exhibits a large conjectured hard phase when $k>2$: When $d \lesssim n \ll d^{\frac{k}{2}}$ it is information theoretically possible to estimate $\mathbb{E} \mathbf{T}_1$, but no polynomial time estimator is known. We provide a sharp analysis of the optimal sample complexity in the Statistical Query (SQ) model and show that SQ algorithms with polynomial query complexity not only fail to solve Tensor PCA in the conjectured hard phase, but also have a strictly sub-optimal sample complexity compared to some polynomial time estimators such as the Richard-Montanari spectral estimator. Our analysis reveals that the optimal sample complexity in the SQ model depends on whether $\mathbb{E} \mathbf{T}_1$ is symmetric or not. For symmetric, even order tensors, we also isolate a sample size regime in which it is possible to test if $\mathbb{E} \mathbf{T}_1 = \mathbf{0}$ or $\mathbb{E}\mathbf{T}_1 \neq \mathbf{0}$ with polynomially many queries but not estimate $\mathbb{E}\mathbf{T}_1$. Our proofs rely on the Fourier analytic approach of Feldman, Perkins and Vempala (2018) to prove sharp SQ lower bounds.
PyKEEN 1.0: A Python Library for Training and Evaluating Knowledge Graph Embeddings
http://jmlr.org/papers/v22/20-825.html
2021Mehdi Ali, Max Berrendorf, Charles Tapley Hoyt, Laurent Vermue, Sahand Sharifzadeh, Volker Tresp, Jens Lehmann
Recently, knowledge graph embeddings (KGEs) have received significant attention, and several software libraries have been developed for training and evaluation. While each of them addresses specific needs, we report on a community effort to a re-design and re-implementation of PyKEEN, one of the early KGE libraries. PyKEEN 1.0 enables users to compose knowledge graph embedding models based on a wide range of interaction models, training approaches, loss functions, and permits the explicit modeling of inverse relations. It allows users to measure each component's influence individually on the model's performance. Besides, an automatic memory optimization has been realized in order to optimally exploit the provided hardware. Through the integration of Optuna, extensive hyper-parameter optimization (HPO) functionalities are provided.
Knowing what You Know: valid and validated confidence sets in multiclass and multilabel prediction
http://jmlr.org/papers/v22/20-753.html
2021Maxime Cauchois, Suyash Gupta, John C. Duchi
We develop conformal prediction methods for constructing valid predictive confidence sets in multiclass and multilabel problems without assumptions on the data generating distribution. A challenge here is that typical conformal prediction methods---which give marginal validity (coverage) guarantees---provide uneven coverage, in that they address easy examples at the expense of essentially ignoring difficult examples. By leveraging ideas from quantile regression, we build methods that always guarantee correct coverage but additionally provide (asymptotically consistent) conditional coverage for both multiclass and multilabel prediction problems. To address the potential challenge of exponentially large confidence sets in multilabel prediction, we build tree-structured classifiers that efficiently account for interactions between labels. Our methods can be bolted on top of any classification model---neural network, random forest, boosted tree---to guarantee its validity. We also provide an empirical evaluation, simultaneously providing new validation methods, that suggests the more robust coverage of our confidence sets.
Communication-Efficient Distributed Covariance Sketch, with Application to Distributed PCA
http://jmlr.org/papers/v22/20-705.html
2021Zengfeng Huang, Xuemin Lin, Wenjie Zhang, Ying Zhang
A sketch of a large data set captures vital properties of the original data while typically occupying much less space. In this paper, we consider the problem of computing a sketch of a massive data matrix $A\in\mathbb{R}^{n\times d}$ that is distributed across $s$ machines. Our goal is to output a matrix $B\in\mathbb{R}^{\ell\times d}$ which is significantly smaller than but still approximates $A$ well in terms of {covariance error}, i.e., $\|{A^TA-B^TB}\|_2$. Such a matrix $B$ is called a covariance sketch of $A$. We are mainly focused on minimizing the communication cost, which is arguably the most valuable resource in distributed computations. We show that there is a nontrivial gap between deterministic and randomized communication complexity for computing a covariance sketch. More specifically, we first prove an almost tight deterministic communication lower bound, then provide a new randomized algorithm with communication cost smaller than the deterministic lower bound. Based on a well-known connection between covariance sketch and approximate principle component analysis, we obtain better communication bounds for the distributed PCA problem. Moreover, we also give an improved distributed PCA algorithm for sparse input matrices, which uses our distributed sketching algorithm as a key building block.
Is SGD a Bayesian sampler? Well, almost
http://jmlr.org/papers/v22/20-676.html
2021Chris Mingard, Guillermo Valle-Pérez, Joar Skalse, Ard A. Louis
Deep neural networks (DNNs) generalise remarkably well in the overparameterised regime, suggesting a strong inductive bias towards functions with low generalisation error. We empirically investigate this bias by calculating, for a range of architectures and datasets, the probability $P_{SGD}(f\mid S)$ that an overparameterised DNN, trained with stochastic gradient descent (SGD) or one of its variants, converges on a function $f$ consistent with a training set $S$. We also use Gaussian processes to estimate the Bayesian posterior probability $P_{B}(f\mid S)$ that the DNN expresses $f$ upon random sampling of its parameters, conditioned on $S$. Our main findings are that $P_{SGD}(f\mid S)$ correlates remarkably well with $P_{B}(f\mid S)$ and that $P_{B}(f\mid S)$ is strongly biased towards low-error and low complexity functions. These results imply that strong inductive bias in the parameter-function map (which determines $P_{B}(f\mid S)$), rather than a special property of SGD, is the primary explanation for why DNNs generalise so well in the overparameterised regime. While our results suggest that the Bayesian posterior $P_{B}(f\mid S)$ is the first order determinant of $P_{SGD}(f\mid S)$, there remain second order differences that are sensitive to hyperparameter tuning. A function probability picture, based on $P_{SGD}(f\mid S)$ and/or $P_{B}(f\mid S)$, can shed light on the way that variations in architecture or hyperparameter settings such as batch size, learning rate, and optimiser choice, affect DNN performance.
POT: Python Optimal Transport
http://jmlr.org/papers/v22/20-451.html
2021Rémi Flamary, Nicolas Courty, Alexandre Gramfort, Mokhtar Z. Alaya, Aurélie Boisbunon, Stanislas Chambon, Laetitia Chapel, Adrien Corenflos, Kilian Fatras, Nemo Fournier, Léo Gautheron, Nathalie T.H. Gayraud, Hicham Janati, Alain Rakotomamonjy, Ievgen Redko, Antoine Rolet, Antony Schutz, Vivien Seguy, Danica J. Sutherland, Romain Tavenard, Alexander Tong, Titouan Vayer
Optimal transport has recently been reintroduced to the machine learning community thanks in part to novel efficient optimization procedures allowing for medium to large scale applications. We propose a Python toolbox that implements several key optimal transport ideas for the machine learning community. The toolbox contains implementations of a number of founding works of OT for machine learning such as Sinkhorn algorithm and Wasserstein barycenters, but also provides generic solvers that can be used for conducting novel fundamental research. This toolbox, named POT for Python Optimal Transport, is open source with an MIT license.
ChainerRL: A Deep Reinforcement Learning Library
http://jmlr.org/papers/v22/20-376.html
2021Yasuhiro Fujita, Prabhat Nagarajan, Toshiki Kataoka, Takahiro Ishikawa
In this paper, we introduce ChainerRL, an open-source deep reinforcement learning (DRL) library built using Python and the Chainer deep learning framework. ChainerRL implements a comprehensive set of DRL algorithms and techniques drawn from state-of-the-art research in the field. To foster reproducible research, and for instructional purposes, ChainerRL provides scripts that closely replicate the original papers' experimental settings and reproduce published benchmark results for several algorithms. Lastly, ChainerRL offers a visualization tool that enables the qualitative inspection of trained agents. The ChainerRL source code can be found on GitHub: https://github.com/chainer/chainerrl.
Analyzing the discrepancy principle for kernelized spectral filter learning algorithms
http://jmlr.org/papers/v22/20-358.html
2021Alain Celisse, Martin Wahl
We investigate the construction of early stopping rules in the nonparametric regression problem where iterative learning algorithms are used and the optimal iteration number is unknown. More precisely, we study the discrepancy principle, as well as modifications based on smoothed residuals, for kernelized spectral filter learning algorithms including Tikhonov regularization and gradient descent. Our main theoretical bounds are oracle inequalities established for the empirical estimation error (fixed design), and for the prediction error (random design). From these finite-sample bounds it follows that the classical discrepancy principle is statistically adaptive for slow rates occurring in the hard learning scenario, while the smoothed discrepancy principles are adaptive over ranges of faster rates (resp. higher smoothness parameters). Our approach relies on deviation inequalities for the stopping rules in the fixed design setting, combined with change-of-norm arguments to deal with the random design setting.
Attention is Turing-Complete
http://jmlr.org/papers/v22/20-302.html
2021Jorge Pérez, Pablo Barceló, Javier Marinkovic
Alternatives to recurrent neural networks, in particular, architectures based on self-attention, are gaining momentum for processing input sequences. In spite of their relevance, the computational properties of such networks have not yet been fully explored.We study the computational power of the Transformer, one of the most paradigmatic architectures exemplifying self-attention. We show that the Transformer with hard-attention is Turing complete exclusively based on their capacity to compute and access internal dense representations of the data.Our study also reveals some minimal sets of elements needed to obtain this completeness result.
Kernel Operations on the GPU, with Autodiff, without Memory Overflows
http://jmlr.org/papers/v22/20-275.html
2021Benjamin Charlier, Jean Feydy, Joan Alexis Glaunès, François-David Collin, Ghislain Durif
The KeOps library provides a fast and memory-efficient GPU support for tensors whose entries are given by a mathematical formula, such as kernel and distance matrices. KeOps alleviates the main bottleneck of tensor-centric libraries for kernel and geometric applications: memory consumption. It also supports automatic differentiation and outperforms standard GPU baselines, including PyTorch CUDA tensors or the Halide and TVM libraries. KeOps combines optimized C++/CUDA schemes with binders for high-level languages: Python (Numpy and PyTorch), Matlab and GNU R. As a result, high-level “quadratic” codes can now scale up to large data sets with millions of samples processed in seconds. KeOps brings graphics-like performances for kernel methods and is freely available on standard repositories (PyPi, CRAN). To showcase its versatility, we provide tutorials in a wide range of settings online at www.kernel-operations.io.
Optimization with Momentum: Dynamical, Control-Theoretic, and Symplectic Perspectives
http://jmlr.org/papers/v22/20-207.html
2021Michael Muehlebach, Michael I. Jordan
We analyze the convergence rate of various momentum-based optimization algorithms from a dynamical systems point of view. Our analysis exploits fundamental topological properties, such as the continuous dependence of iterates on their initial conditions, to provide a simple characterization of convergence rates. In many cases, closed-form expressions are obtained that relate algorithm parameters to the convergence rate. The analysis encompasses discrete time and continuous time, as well as time-invariant and time-variant formulations, and is not limited to a convex or Euclidean setting. In addition, the article rigorously establishes why symplectic discretization schemes are important for momentum-based optimization algorithms, and provides a characterization of algorithms that exhibit accelerated convergence.
Prediction against a limited adversary
http://jmlr.org/papers/v22/20-1234.html
2021Erhan Bayraktar, Ibrahim Ekren, Xin Zhang
We study the problem of prediction with expert advice with adversarial corruption where the adversary can at most corrupt one expert. Using tools from viscosity theory, we characterize the long-time behavior of the value function of the game between the forecaster and the adversary. We provide lower and upper bounds for the growth rate of regret without relying on a comparison result. We show that depending on the description of regret, the limiting behavior of the game can significantly differ.
Phase Diagram for Two-layer ReLU Neural Networks at Infinite-width Limit
http://jmlr.org/papers/v22/20-1123.html
2021Tao Luo, Zhi-Qin John Xu, Zheng Ma, Yaoyu Zhang
How neural network behaves during the training over different choices of hyperparameters is an important question in the study of neural networks. In this work, inspired by the phase diagram in statistical mechanics, we draw the phase diagram for the two-layer ReLU neural network at the infinite-width limit for a complete characterization of its dynamical regimes and their dependence on hyperparameters related to initialization. Through both experimental and theoretical approaches, we identify three regimes in the phase diagram, i.e., linear regime, critical regime and condensed regime, based on the relative change of input weights as the width approaches infinity, which tends to $0$, $O(1)$ and $+\infty$, respectively. In the linear regime, NN training dynamics is approximately linear similar to a random feature model with an exponential loss decay. In the condensed regime, we demonstrate through experiments that active neurons are condensed at several discrete orientations. The critical regime serves as the boundary between above two regimes, which exhibits an intermediate nonlinear behavior with the mean-field model as a typical example. Overall, our phase diagram for the two-layer ReLU NN serves as a map for the future studies and is a first step towards a more systematical investigation of the training behavior and the implicit regularization of NNs of different structures.
Testing Conditional Independence via Quantile Regression Based Partial Copulas
http://jmlr.org/papers/v22/20-1074.html
2021Lasse Petersen, Niels Richard Hansen
The partial copula provides a method for describing the dependence between two random variables $X$ and $Y$ conditional on a third random vector $Z$ in terms of nonparametric residuals $U_1$ and $U_2$. This paper develops a nonparametric test for conditional independence by combining the partial copula with a quantile regression based method for estimating the nonparametric residuals. We consider a test statistic based on generalized correlation between $U_1$ and $U_2$ and derive its large sample properties under consistency assumptions on the quantile regression procedure. We demonstrate through a simulation study that the resulting test is sound under complicated data generating distributions. Moreover, in the examples considered the test is competitive to other state-of-the-art conditional independence tests in terms of level and power, and it has superior power in cases with conditional variance heterogeneity of $X$ and $Y$ given $Z$.
Determining the Number of Communities in Degree-corrected Stochastic Block Models
http://jmlr.org/papers/v22/20-037.html
2021Shujie Ma, Liangjun Su, Yichong Zhang
We propose to estimate the number of communities in degree-corrected stochastic block models based on a pseudo likelihood ratio statistic. To this end, we introduce a method that combines spectral clustering with binary segmentation. This approach guarantees an upper bound for the pseudo likelihood ratio statistic when the model is over-fitted. We also derive its limiting distribution when the model is under-fitted. Based on these properties, we establish the consistency of our estimator for the true number of communities. Developing these theoretical properties require a mild condition on the average degrees - growing at a rate no slower than log(n), where n is the number of nodes. Our proposed method is further illustrated by simulation studies and analysis of real-world networks. The numerical results show that our approach has satisfactory performance when the network is semi-dense.
Path Length Bounds for Gradient Descent and Flow
http://jmlr.org/papers/v22/19-979.html
2021Chirag Gupta, Sivaraman Balakrishnan, Aaditya Ramdas
We derive bounds on the path length $\zeta$ of gradient descent (GD) and gradient flow (GF) curves for various classes of smooth convex and nonconvex functions. Among other results, we prove that: (a) if the iterates are linearly convergent with factor $(1-c)$, then $\zeta$ is at most $\mathcal{O}(1/c)$; (b) under the Polyak-Kurdyka-\L ojasiewicz (PKL) condition, $\zeta$ is at most $\mathcal{O}(\sqrt{\kappa})$, where $\kappa$ is the condition number, and at least $\widetilde\Omega(\sqrt{d} \wedge \kappa^{1/4})$; (c) for quadratics, $\zeta$ is $\Theta(\min\{\sqrt{d},\sqrt{\log \kappa}\})$ and in some cases can be independent of $\kappa$; (d) assuming just convexity, $\zeta$ can be at most $2^{4d\log d}$; (e) for separable quasiconvex functions, $\zeta$ is ${\Theta}(\sqrt{d})$. Thus, we advance current understanding of the properties of GD and GF curves beyond rates of convergence. We expect our techniques to facilitate future studies for other algorithms.
A General Framework for Empirical Bayes Estimation in Discrete Linear Exponential Family
http://jmlr.org/papers/v22/19-873.html
2021Trambak Banerjee, Qiang Liu, Gourab Mukherjee, Wengunag Sun
We develop a Nonparametric Empirical Bayes (NEB) framework for compound estimation in the discrete linear exponential family, which includes a wide class of discrete distributions frequently arising from modern big data applications. We propose to directly estimate the Bayes shrinkage factor in the generalized Robbins' formula via solving a convex program, which is carefully developed based on a RKHS representation of the Stein's discrepancy measure. The new NEB estimation framework is flexible for incorporating various structural constraints into the data driven rule, and provides a unified approach to compound estimation with both regular and scaled squared error losses. We develop theory to show that the class of NEB estimators enjoys strong asymptotic properties. Comprehensive simulation studies as well as analyses of real data examples are carried out to demonstrate the superiority of the NEB estimator over competing methods.
Approximate Newton Methods
http://jmlr.org/papers/v22/19-870.html
2021Haishan Ye, Luo Luo, Zhihua Zhang
Many machine learning models involve solving optimization problems. Thus, it is important to address a large-scale optimization problem in big data applications. Recently, subsampled Newton methods have emerged to attract much attention due to their efficiency at each iteration, rectified a weakness in the ordinary Newton method of suffering a high cost in each iteration while commanding a high convergence rate. Other efficient stochastic second order methods have been also proposed. However, the convergence properties of these methods are still not well understood. There are also several important gaps between the current convergence theory and the empirical performance in real applications. In this paper, we aim to fill these gaps. We propose a unifying framework to analyze both local and global convergence properties of second order methods. Accordingly, we present our theoretical results which match the empirical performance in real applications well.
Dynamic Tensor Recommender Systems
http://jmlr.org/papers/v22/19-792.html
2021Yanqing Zhang, Xuan Bi, Niansheng Tang, Annie Qu
Recommender systems have been extensively used by the entertainment industry, business marketing and the biomedical industry. In addition to its capacity of providing preference based recommendations as an unsupervised learning methodology, it has been also proven useful in sales forecasting, product introduction and other production related businesses. Since some consumers and companies need a recommendation or prediction for future budget, labor and supply chain coordination, dynamic recommender systems for precise forecasting have become extremely necessary. In this article, we propose a new recommendation method, namely the dynamic tensor recommender system (DTRS), which aims particularly at forecasting future recommendation. The proposed method utilizes a tensor-valued function of time to integrate time and contextual information, and creates a time-varying coefficient model for temporal tensor factorization through a polynomial spline approximation. Major advantages of the proposed method include competitive future recommendation predictions and effective prediction interval estimations. In theory, we establish the convergence rate of the proposed tensor factorization and asymptotic normality of the spline coefficient estimator. The proposed method is applied to simulations, IRI marketing data and Last.fm data. Numerical studies demonstrate that the proposed method outperforms existing methods in terms of future time forecasting.
Sparse Tensor Additive Regression
http://jmlr.org/papers/v22/19-769.html
2021Botao Hao, Boxiang Wang, Pengyuan Wang, Jingfei Zhang, Jian Yang, Will Wei Sun
Tensors are becoming prevalent in modern applications such as medical imaging and digital marketing. In this paper, we propose a sparse tensor additive regression (STAR) that models a scalar response as a flexible nonparametric function of tensor covariates. The proposed model effectively exploits the sparse and low-rank structures in the tensor additive regression. We formulate the parameter estimation as a non-convex optimization problem, and propose an efficient penalized alternating minimization algorithm. We establish a non-asymptotic error bound for the estimator obtained from each iteration of the proposed algorithm, which reveals an interplay between the optimization error and the statistical rate of convergence. We demonstrate the efficacy of STAR through extensive comparative simulation studies, and an application to the click-through-rate prediction in online advertising.
Geometric structure of graph Laplacian embeddings
http://jmlr.org/papers/v22/19-683.html
2021Nicolás García Trillos, Franca Hoffmann, Bamdad Hosseini
We analyze the spectral clustering procedure for identifying coarse structure in a data set $\mathbf{x}_1, \dots, \mathbf{x}_n$, and in particular study the geometry of graph Laplacian embeddings which form the basis for spectral clustering algorithms. More precisely, we assume that the data are sampled from a mixture model supported on a manifold $\mathcal{M}$ embedded in $\mathbb{R}^d$, and pick a connectivity length-scale $\varepsilon>0$ to construct a kernelized graph Laplacian. We introduce a notion of a well-separated mixture model which only depends on the model itself, and prove that when the model is well separated, with high probability the embedded data set concentrates on cones that are centered around orthogonal vectors. Our results are meaningful in the regime where $\varepsilon = \varepsilon(n)$ is allowed to decay to zero at a slow enough rate as the number of data points grows. This rate depends on the intrinsic dimension of the manifold on which the data is supported.
How to Gain on Power: Novel Conditional Independence Tests Based on Short Expansion of Conditional Mutual Information
http://jmlr.org/papers/v22/19-600.html
2021Mariusz Kubkowski, Jan Mielniczuk, Paweł Teisseyre
Conditional independence tests play a crucial role in many machine learning procedures such as feature selection, causal discovery, and structure learning of dependence networks. They are used in most of the existing algorithms for Markov Blanket discovery such as Grow-Shrink or Incremental Association Markov Blanket. One of the most frequently used tests for categorical variables is based on the conditional mutual information ($CMI$) and its asymptotic distribution. However, it is known that the power of such test dramatically decreases when the size of the conditioning set grows, i.e. the test fails to detect true significant variables, when the set of already selected variables is large. To overcome this drawback for discrete data, we propose to replace the conditional mutual information by Short Expansion of Conditional Mutual Information (called $SECMI$), obtained by truncating the Möbius representation of $CMI$. We prove that the distribution of $SECMI$ converges to either a normal distribution or to a distribution of some quadratic form in normal random variables. This property is crucial for the construction of a novel test of conditional independence which uses one of these distributions, chosen in a data dependent way, as a reference under the null hypothesis. The proposed methods have significantly larger power for discrete data than the standard asymptotic tests of conditional independence based on $CMI$ while retaining control of the probability of type I error.
Stochastic Proximal AUC Maximization
http://jmlr.org/papers/v22/19-418.html
2021Yunwen Lei, Yiming Ying
In this paper we consider the problem of maximizing the Area under the ROC curve (AUC) which is a widely used performance metric in imbalanced classification and anomaly detection. Due to the pairwise nonlinearity of the objective function, classical SGD algorithms do not apply to the task of AUC maximization. We propose a novel stochastic proximal algorithm for AUC maximization which is scalable to large scale streaming data. Our algorithm can accommodate general penalty terms and is easy to implement with favorable $O(d)$ space and per-iteration time complexities. We establish a high-probability convergence rate $O(1/\sqrt{T})$ for the general convex setting, and improve it to a fast convergence rate $O(1/T)$ for the cases of strongly convex regularizers and no regularization term (without strong convexity). Our proof does not need the uniform boundedness assumption on the loss function or the iterates which is more fidelity to the practice. Finally, we perform extensive experiments over various benchmark data sets from real-world application domains which show the superior performance of our algorithm over the existing AUC maximization algorithms.
A Distributed Method for Fitting Laplacian Regularized Stratified Models
http://jmlr.org/papers/v22/19-345.html
2021Jonathan Tuck, Shane Barratt, Stephen Boyd
Stratified models are models that depend in an arbitrary way on a set of selected categorical features, and depend linearly on the other features. In a basic and traditional formulation a separate model is fit for each value of the categorical feature, using only the data that has the specific categorical value. To this formulation we add Laplacian regularization, which encourages the model parameters for neighboring categorical values to be similar. Laplacian regularization allows us to specify one or more weighted graphs on the stratification feature values. For example, stratifying over the days of the week, we can specify that the Sunday model parameter should be close to the Saturday and Monday model parameters. The regularization improves the performance of the model over the traditional stratified model, since the model for each value of the categorical `borrows strength' from its neighbors. In particular, it produces a model even for categorical values that did not appear in the training data set. We propose an efficient distributed method for fitting stratified models, based on the alternating direction method of multipliers (ADMM). When the fitting loss functions are convex, the stratified model fitting problem is convex, and our method computes the global minimizer of the loss plus regularization; in other cases it computes a local minimizer. The method is very efficient, and naturally scales to large data sets or numbers of stratified feature values. We illustrate our method with a variety of examples.
Predictive Learning on Hidden Tree-Structured Ising Models
http://jmlr.org/papers/v22/19-149.html
2021Konstantinos E. Nikolakakis, Dionysios S. Kalogerias, Anand D. Sarwate
We provide high-probability sample complexity guarantees for exact structure recovery and accurate predictive learning using noise-corrupted samples from an acyclic (tree-shaped) graphical model. The hidden variables follow a tree-structured Ising model distribution, whereas the observable variables are generated by a binary symmetric channel taking the hidden variables as its input (flipping each bit independently with some constant probability $q\in [0,1/2)$). In the absence of noise, predictive learning on Ising models was recently studied by Bresler and Karzand (2020); this paper quantifies how noise in the hidden model impacts the tasks of structure recovery and marginal distribution estimation by proving upper and lower bounds on the sample complexity. Our results generalize state-of-the-art bounds reported in prior work, and they exactly recover the noiseless case ($q=0$). In fact, for any tree with $p$ vertices and probability of incorrect recovery $\delta>0$, the sufficient number of samples remains logarithmic as in the noiseless case, i.e., $\mathcal{O}(\log(p/\delta))$, while the dependence on $q$ is $\mathcal{O}\big( 1/(1-2q)^{4} \big)$, for both aforementioned tasks. We also present a new equivalent of Isserlis' Theorem for sign-valued tree-structured distributions, yielding a new low-complexity algorithm for higher-order moment estimation.
Estimation and Inference for High Dimensional Generalized Linear Models: A Splitting and Smoothing Approach
http://jmlr.org/papers/v22/19-132.html
2021Zhe Fei, Yi Li
The focus of modern biomedical studies has gradually shifted to explanation and estimation of joint effects of high dimensional predictors on disease risks. Quantifying uncertainty in these estimates may provide valuable insight into prevention strategies or treatment decisions for both patients and physicians. High dimensional inference, including confidence intervals and hypothesis testing, has sparked much interest. While much work has been done in the linear regression setting, there is lack of literature on inference for high dimensional generalized linear models. We propose a novel and computationally feasible method, which accommodates a variety of outcome types, including normal, binomial, and Poisson data. We use a “splitting and smoothing” approach, which splits samples into two parts, performs variable selection using one part and conducts partial regression with the other part. Averaging the estimates over multiple random splits, we obtain the smoothed estimates, which are numerically stable. We show that the estimates are consistent, asymptotically normal, and construct confidence intervals with proper coverage probabilities for all predictors. We examine the finite sample performance of our method by comparing it with the existing methods and applying it to analyze a lung cancer cohort study.
Normalizing Flows for Probabilistic Modeling and Inference
http://jmlr.org/papers/v22/19-1028.html
2021George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, Balaji Lakshminarayanan
Normalizing flows provide a general mechanism for defining expressive probability distributions, only requiring the specification of a (usually simple) base distribution and a series of bijective transformations. There has been much recent work on normalizing flows, ranging from improving their expressive power to expanding their application. We believe the field has now matured and is in need of a unified perspective. In this review, we attempt to provide such a perspective by describing flows through the lens of probabilistic modeling and inference. We place special emphasis on the fundamental principles of flow design, and discuss foundational topics such as expressive power and computational trade-offs. We also broaden the conceptual framing of flows by relating them to more general probability transformations. Lastly, we summarize the use of flows for tasks such as generative modeling, approximate inference, and supervised learning.
Incorporating Unlabeled Data into Distributionally Robust Learning
http://jmlr.org/papers/v22/19-1023.html
2021Charlie Frogner, Sebastian Claici, Edward Chien, Justin Solomon
We study a robust alternative to empirical risk minimization called distributionally robust learning (DRL), in which one learns to perform against an adversary who can choose the data distribution from a specified set of distributions. We illustrate a problem with current DRL formulations, which rely on an overly broad definition of allowed distributions for the adversary, leading to learned classifiers that are unable to predict with any confidence. We propose a solution that incorporates unlabeled data into the DRL problem to further constrain the adversary. We show that this new formulation is tractable for stochastic gradient-based optimization and yields a computable guarantee on the future performance of the learned classifier, analogous to -- but tighter than -- guarantees from conventional DRL. We examine the performance of this new formulation on 14 real data sets and find that it often yields effective classifiers with nontrivial performance guarantees in situations where conventional DRL produces neither. Inspired by these results, we extend our DRL formulation to active learning with a novel, distributionally-robust version of the standard model-change heuristic. Our active learning algorithm often achieves superior learning performance to the original heuristic on real data sets.
Integrative Generalized Convex Clustering Optimization and Feature Selection for Mixed Multi-View Data
http://jmlr.org/papers/v22/19-1012.html
2021Minjie Wang, Genevera I. Allen
In mixed multi-view data, multiple sets of diverse features are measured on the same set of samples. By integrating all available data sources, we seek to discover common group structure among the samples that may be hidden in individualistic cluster analyses of a single data view. While several techniques for such integrative clustering have been explored, we propose and develop a convex formalization that enjoys strong empirical performance and inherits the mathematical properties of increasingly popular convex clustering methods. Specifically, our Integrative Generalized Convex Clustering Optimization (iGecco) method employs different convex distances, losses, or divergences for each of the different data views with a joint convex fusion penalty that leads to common groups. Additionally, integrating mixed multi-view data is often challenging when each data source is high-dimensional. To perform feature selection in such scenarios, we develop an adaptive shifted group-lasso penalty that selects features by shrinking them towards their loss-specific centers. Our so-called iGecco+ approach selects features from each data view that are best for determining the groups, often leading to improved integrative clustering. To solve our problem, we develop a new type of generalized multi-block ADMM algorithm using sub-problem approximations that more efficiently fits our model for big data sets. Through a series of numerical experiments and real data examples on text mining and genomics, we show that iGecco+ achieves superior empirical performance for high-dimensional mixed multi-view data.
GemBag: Group Estimation of Multiple Bayesian Graphical Models
http://jmlr.org/papers/v22/19-1006.html
2021Xinming Yang, Lingrui Gan, Naveen N. Narisetty, Feng Liang
In this paper, we propose a novel hierarchical Bayesian model and an efficient estimation method for the problem of joint estimation of multiple graphical models, which have similar but different sparsity structures and signal strength. Our proposed hierarchical Bayesian model is well suited for sharing of sparsity structures, and our procedure, called as GemBag, is shown to enjoy optimal theoretical properties in terms of sup-norm estimation accuracy and correct recovery of the graphical structure even when some of the signals are weak. Although optimization of the posterior distribution required for obtaining our proposed estimator is a non-convex optimization problem, we show that it turns out to be convex in a large constrained space facilitating the use of computationally efficient algorithms. Through extensive simulation studies and an application to a bike sharing data set, we demonstrate that the proposed GemBag procedure has strong empirical performance in comparison with alternative methods.
Subspace Clustering through Sub-Clusters
http://jmlr.org/papers/v22/18-780.html
2021Weiwei Li, Jan Hannig, Sayan Mukherjee
The problem of dimension reduction is of increasing importance in modern data analysis. In this paper, we consider modeling the collection of points in a high dimensional space as a union of low dimensional subspaces. In particular we propose a highly scalable sampling based algorithm that clusters the entire data via first spectral clustering of a small random sample followed by classifying or labeling the remaining out-of-sample points. The key idea is that this random subset borrows information across the entire dataset and that the problem of clustering points can be replaced with the more efficient problem of "clustering sub-clusters". We provide theoretical guarantees for our procedure. The numerical results indicate that for large datasets the proposed algorithm outperforms other state-of-the-art subspace clustering algorithms with respect to accuracy and speed.
Sparse and Smooth Signal Estimation: Convexification of L0-Formulations
http://jmlr.org/papers/v22/18-745.html
2021Alper Atamturk, Andres Gomez, Shaoning Han
Signal estimation problems with smoothness and sparsity priors can be naturally modeled as quadratic optimization with $\ell_0$-“norm” constraints. Since such problems are non-convex and hard-to-solve, the standard approach is, instead, to tackle their convex surrogates based on $\ell_1$-norm relaxations. In this paper, we propose new iterative (convex) conic quadratic relaxations that exploit not only the $\ell_0$-“norm” terms, but also the fitness and smoothness functions. The iterative convexification approach substantially closes the gap between the $\ell_0$-“norm” and its $\ell_1$ surrogate. These stronger relaxations lead to significantly better estimators than $\ell_1$-norm approaches and also allow one to utilize affine sparsity priors. In addition, the parameters of the model and the resulting estimators are easily interpretable. Experiments with a tailored Lagrangian decomposition method indicate that the proposed iterative convex relaxations yield solutions within 1\% of the exact $\ell_0$-approach, and can tackle instances with up to 100,000 variables under one minute.
Projection-free Decentralized Online Learning for Submodular Maximization over Time-Varying Networks
http://jmlr.org/papers/v22/18-407.html
2021Junlong Zhu, Qingtao Wu, Mingchuan Zhang, Ruijuan Zheng, Keqin Li
This paper considers a decentralized online submodular maximization problem over time-varying networks, where each agent only utilizes its own information and the received information from its neighbors. To address the problem, we propose a decentralized Meta-Frank-Wolfe online learning method in the adversarial online setting by using local communication and local computation. Moreover, we show that an expected regret bound of $O(\sqrt{T})$ is achieved with $(1-1/e)$ approximation guarantee, where $T$ is a time horizon. In addition, we also propose a decentralized one-shot Frank-Wolfe online learning method in the stochastic online setting. Furthermore, we also show that an expected regret bound $O(T^{2/3})$ is obtained with $(1-1/e)$ approximation guarantee. Finally, we confirm the theoretical results via various experiments on different datasets.
Structure Learning of Undirected Graphical Models for Count Data
http://jmlr.org/papers/v22/18-401.html
2021Nguyen Thi Kim Hue, Monica Chiogna
Mainly motivated by the problem of modelling biological processes underlying the basic functions of a cell -that typically involve complex interactions between genes- we present a new algorithm, called PC-LPGM, for learning the structure of undirected graphical models over discrete variables. We prove theoretical consistency of PC-LPGM in the limit of infinite observations and discuss its robustness to model misspecification. To evaluate the performance of PC-LPGM in recovering the true structure of the graphs in situations where relatively moderate sample sizes are available, extensive simulation studies are conducted, that also allow to compare our proposal with its main competitors. A biological validation of the algorithm is presented through the analysis of two real data sets.
From Low Probability to High Confidence in Stochastic Convex Optimization
http://jmlr.org/papers/v22/20-821.html
2021Damek Davis, Dmitriy Drusvyatskiy, Lin Xiao, Junyu Zhang
Standard results in stochastic convex optimization bound the number of samples that an algorithm needs to generate a point with small function value in expectation. More nuanced high probability guarantees are rare, and typically either rely on light-tail noise assumptions or exhibit worse sample complexity. In this work, we show that a wide class of stochastic optimization algorithms for strongly convex problems can be augmented with high confidence bounds at an overhead cost that is only logarithmic in the confidence level and polylogarithmic in the condition number. The procedure we propose, called proxBoost, is elementary and builds on two well-known ingredients: robust distance estimation and the proximal point method. We discuss consequences for both streaming (online) algorithms and offline algorithms based on empirical risk minimization.
Optimal Feedback Law Recovery by Gradient-Augmented Sparse Polynomial Regression
http://jmlr.org/papers/v22/20-755.html
2021Behzad Azmi, Dante Kalise, Karl Kunisch
A sparse regression approach for the computation of high-dimensional optimal feedback laws arising in deterministic nonlinear control is proposed. The approach exploits the control-theoretical link between Hamilton-Jacobi-Bellman PDEs characterizing the value function of the optimal control problems, and first-order optimality conditions via Pontryagin's Maximum Principle. The latter is used as a representation formula to recover the value function and its gradient at arbitrary points in the space-time domain through the solution of a two-point boundary value problem. After generating a dataset consisting of different state-value pairs, a hyperbolic cross polynomial model for the value function is fitted using a LASSO regression. An extended set of low and high-dimensional numerical tests in nonlinear optimal control reveal that enriching the dataset with gradient information reduces the number of training samples, and that the sparse polynomial regression consistently yields a feedback law of lower complexity.
Understanding Recurrent Neural Networks Using Nonequilibrium Response Theory
http://jmlr.org/papers/v22/20-620.html
2021Soon Hoe Lim
Recurrent neural networks (RNNs) are brain-inspired models widely used in machine learning for analyzing sequential data. The present work is a contribution towards a deeper understanding of how RNNs process input signals using the response theory from nonequilibrium statistical mechanics. For a class of continuous-time stochastic RNNs (SRNNs) driven by an input signal, we derive a Volterra type series representation for their output. This representation is interpretable and disentangles the input signal from the SRNN architecture. The kernels of the series are certain recursively defined correlation functions with respect to the unperturbed dynamics that completely determine the output. Exploiting connections of this representation and its implications to rough paths theory, we identify a universal feature -- the response feature, which turns out to be the signature of tensor product of the input signal and a natural support basis. In particular, we show that SRNNs, with only the weights in the readout layer optimized and the weights in the hidden layer kept fixed and not optimized, can be viewed as kernel machines operating on a reproducing kernel Hilbert space associated with the response feature
Optimal Structured Principal Subspace Estimation: Metric Entropy and Minimax Rates
http://jmlr.org/papers/v22/20-610.html
2021Tony Cai, Hongzhe Li, Rong Ma
Driven by a wide range of applications, several principal subspace estimation problems have been studied individually under different structural constraints. This paper presents a unified framework for the statistical analysis of a general structured principal subspace estimation problem which includes as special cases sparse PCA/SVD, non-negative PCA/SVD, subspace constrained PCA/SVD, and spectral clustering. General minimax lower and upper bounds are established to characterize the interplay between the information-geometric complexity of the constraint set for the principal subspaces, the signal-to-noise ratio (SNR), and the dimensionality. The results yield interesting phase transition phenomena concerning the rates of convergence as a function of the SNRs and the fundamental limit for consistent estimation. Applying the general results to the specific settings yields the minimax rates of convergence for those problems, including the previous unknown optimal rates for sparse SVD, non-negative PCA/SVD and subspace constrained PCA/SVD.
RaSE: Random Subspace Ensemble Classification
http://jmlr.org/papers/v22/20-600.html
2021Ye Tian, Yang Feng
We propose a flexible ensemble classification framework, Random Subspace Ensemble (RaSE), for sparse classification. In the RaSE algorithm, we aggregate many weak learners, where each weak learner is a base classifier trained in a subspace optimally selected from a collection of random subspaces. To conduct subspace selection, we propose a new criterion, ratio information criterion (RIC), based on weighted Kullback-Leibler divergence. The theoretical analysis includes the risk and Monte-Carlo variance of the RaSE classifier, establishing the screening consistency and weak consistency of RIC, and providing an upper bound for the misclassification rate of the RaSE classifier. In addition, we show that in a high-dimensional framework, the number of random subspaces needs to be very large to guarantee that a subspace covering signals is selected. Therefore, we propose an iterative version of the RaSE algorithm and prove that under some specific conditions, a smaller number of generated random subspaces are needed to find a desirable subspace through iteration. An array of simulations under various models and real-data applications demonstrate the effectiveness and robustness of the RaSE classifier and its iterative version in terms of low misclassification rate and accurate feature ranking. The RaSE algorithm is implemented in the R package RaSEn on CRAN.
Wasserstein barycenters can be computed in polynomial time in fixed dimension
http://jmlr.org/papers/v22/20-588.html
2021Jason M Altschuler, Enric Boix-Adsera
Computing Wasserstein barycenters is a fundamental geometric problem with widespread applications in machine learning, statistics, and computer graphics. However, it is unknown whether Wasserstein barycenters can be computed in polynomial time, either exactly or to high precision (i.e., with $\textrm{polylog}(1/\varepsilon)$ runtime dependence). This paper answers these questions in the affirmative for any fixed dimension. Our approach is to solve an exponential-size linear programming formulation by efficiently implementing the corresponding separation oracle using techniques from computational geometry.
Banach Space Representer Theorems for Neural Networks and Ridge Splines
http://jmlr.org/papers/v22/20-583.html
2021Rahul Parhi, Robert D. Nowak
We develop a variational framework to understand the properties of the functions learned by neural networks fit to data. We propose and study a family of continuous-domain linear inverse problems with total variation-like regularization in the Radon domain subject to data fitting constraints. We derive a representer theorem showing that finite-width, single-hidden layer neural networks are solutions to these inverse problems. We draw on many techniques from variational spline theory and so we propose the notion of polynomial ridge splines, which correspond to single-hidden layer neural networks with truncated power functions as the activation function. The representer theorem is reminiscent of the classical reproducing kernel Hilbert space representer theorem, but we show that the neural network problem is posed over a non-Hilbertian Banach space. While the learning problems are posed in the continuous-domain, similar to kernel methods, the problems can be recast as finite-dimensional neural network training problems. These neural network training problems have regularizers which are related to the well-known weight decay and path-norm regularizers. Thus, our result gives insight into functional characteristics of trained neural networks and also into the design neural network regularizers. We also show that these regularizers promote neural network solutions with desirable generalization properties.
High-Order Langevin Diffusion Yields an Accelerated MCMC Algorithm
http://jmlr.org/papers/v22/20-576.html
2021Wenlong Mou, Yi-An Ma, Martin J. Wainwright, Peter L. Bartlett, Michael I. Jordan
We propose a Markov chain Monte Carlo (MCMC) algorithm based on third-order Langevin dynamics for sampling from distributions with smooth, log-concave densities. The higher-order dynamics allow for more flexible discretization schemes, and we develop a specific method that combines splitting with more accurate integration. For a broad class of $d$-dimensional distributions arising from generalized linear models, we prove that the resulting third-order algorithm produces samples from a distribution that is at most $\varepsilon > 0$ in Wasserstein distance from the target distribution in $O\left(\frac{d^{1/4}}{ \varepsilon^{1/2}} \right)$ steps. This result requires only Lipschitz conditions on the gradient. For general strongly convex potentials with $\alpha$-th order smoothness, we prove that the mixing time scales as $O \left( \frac{d^{1/4}}{\varepsilon^{1/2}} + \frac{d^{1/2}}{ \varepsilon^{1/(\alpha - 1)}} \right)$.
From Fourier to Koopman: Spectral Methods for Long-term Time Series Prediction
http://jmlr.org/papers/v22/20-406.html
2021Henning Lange, Steven L. Brunton, J. Nathan Kutz
We propose spectral methods for long-term forecasting of temporal signals stemming from linear and nonlinear quasi-periodic dynamical systems. For linear signals, we introduce an algorithm with similarities to the Fourier transform but which does not rely on periodicity assumptions, allowing for forecasting given potentially arbitrary sampling intervals. We then extend this algorithm to handle nonlinearities by leveraging Koopman theory. The resulting algorithm performs a spectral decomposition in a nonlinear, data-dependent basis. The optimization objective for both algorithms is highly non-convex. However, expressing the objective in the frequency domain allows us to compute global optima of the error surface in a scalable and efficient manner, partially by exploiting the computational properties of the Fast Fourier Transform. Because of their close relation to Bayesian Spectral Analysis, uncertainty quantification metrics are a natural byproduct of the spectral forecasting methods. We extensively benchmark these algorithms against other leading forecasting methods on a range of synthetic experiments as well as in the context of real-world power systems and fluid flows.
Residual Energy-Based Models for Text
http://jmlr.org/papers/v22/20-326.html
2021Anton Bakhtin, Yuntian Deng, Sam Gross, Myle Ott, Marc'Aurelio Ranzato, Arthur Szlam
Current large-scale auto-regressive language models display impressive fluency and can generate convincing text. In this work we start by asking the question: Can the generations of these models be reliably distinguished from real text by statistical discriminators? We find experimentally that the answer is affirmative when we have access to the training data for the model, and guardedly affirmative even if we do not. This suggests that the auto-regressive models can be improved by incorporating the (globally normalized) discriminators into the generative process. We give a formalism for this using the Energy-Based Model framework, and show that it indeed improves the results of the generative models, measured both in terms of perplexity and in terms of human evaluation.
giotto-tda: : A Topological Data Analysis Toolkit for Machine Learning and Data Exploration
http://jmlr.org/papers/v22/20-325.html
2021Guillaume Tauzin, Umberto Lupo, Lewis Tunstall, Julian Burella Pérez, Matteo Caorsi, Anibal M. Medina-Mardones, Alberto Dassatti, Kathryn Hess
We introduce giotto-tda, a Python library that integrates high-performance topological data analysis with machine learning via a scikit-learn-compatible API and state-of-the-art C++ implementations. The library's ability to handle various types of data is rooted in a wide range of preprocessing techniques, and its strong focus on data exploration and interpretability is aided by an intuitive plotting API. Source code, binaries, examples, and documentation can be found at https://github.com/giotto-ai/giotto-tda.
Risk-Averse Learning by Temporal Difference Methods with Markov Risk Measures
http://jmlr.org/papers/v22/20-168.html
2021Umit Köse, Andrzej Ruszczyński
We propose a novel reinforcement learning methodology where the system performance is evaluated by a Markov coherent dynamic risk measure with the use of linear value function approximations. We construct projected risk-averse dynamic programming equations and study their properties. We propose new risk-averse counterparts of the basic and multi-step methods of temporal differences and we prove their convergence with probability one. We also perform an empirical study on a complex transportation problem.
A Bayesian Contiguous Partitioning Method for Learning Clustered Latent Variables
http://jmlr.org/papers/v22/20-136.html
2021Zhao Tang Luo, Huiyan Sang, Bani Mallick
This article develops a Bayesian partitioning prior model from spanning trees of a graph, by first assigning priors on spanning trees, and then the number and the positions of removed edges given a spanning tree. The proposed method guarantees contiguity in clustering and allows to detect clusters with arbitrary shapes and sizes, whereas most existing partition models such as binary trees and Voronoi tessellations do not possess such properties. We embed this partition model within a hierarchical modeling framework to detect a clustered pattern in latent variables. We focus on illustrating the method through a clustered regression coefficient model for spatial data and propose extensions to other hierarchical models. We prove Bayesian posterior concentration results under an asymptotic framework with random graphs. We design an efficient collapsed Reversible Jump Markov chain Monte Carlo (RJ-MCMC) algorithm to estimate the clustered coefficient values and their uncertainty measures. Finally, we illustrate the performance of the model with simulation studies and a real data analysis of detecting the temperature-salinity relationship from water masses in the Atlantic Ocean.
Multi-class Gaussian Process Classification with Noisy Inputs
http://jmlr.org/papers/v22/20-107.html
2021Carlos Villacampa-Calvo, Bryan Zaldívar, Eduardo C. Garrido-Merchán, Daniel Hernández-Lobato
It is a common practice in the machine learning community to assume that the observed data are noise-free in the input attributes. Nevertheless, scenarios with input noise are common in real problems, as measurements are never perfectly accurate. If this input noise is not taken into account, a supervised machine learning method is expected to perform sub-optimally. In this paper, we focus on multi-class classification problems and use Gaussian processes (GPs) as the underlying classifier. Motivated by a data set coming from the astrophysics domain, we hypothesize that the observed data may contain noise in the inputs. Therefore, we devise several multi-class GP classifiers that can account for input noise. Such classifiers can be efficiently trained using variational inference to approximate the posterior distribution of the latent variables of the model. Moreover, in some situations, the amount of noise can be known before-hand. If this is the case, it can be readily introduced in the proposed methods. This prior information is expected to lead to better performance results. We have evaluated the proposed methods by carrying out several experiments, involving synthetic and real data. These include several data sets from the UCI repository, the MNIST data set and a data set coming from astrophysics. The results obtained show that, although the classification error is similar across methods, the predictive distribution of the proposed methods is better, in terms of the test log-likelihood, than the predictive distribution of a classifier based on GPs that ignores input noise.
Learning and Planning for Time-Varying MDPs Using Maximum Likelihood Estimation
http://jmlr.org/papers/v22/20-006.html
2021Melkior Ornik, Ufuk Topcu
This paper proposes a formal approach to online learning and planning for agents operating in a priori unknown, time-varying environments. The proposed method computes the maximally likely model of the environment, given the observations about the environment made by an agent earlier in the system run and assuming knowledge of a bound on the maximal rate of change of system dynamics. Such an approach generalizes the estimation method commonly used in learning algorithms for unknown Markov decision processes with time-invariant transition probabilities, but is also able to quickly and correctly identify the system dynamics following a change. Based on the proposed method, we generalize the exploration bonuses used in learning for time-invariant Markov decision processes by introducing a notion of uncertainty in a learned time-varying model, and develop a control policy for time-varying Markov decision processes based on the exploitation and exploration trade-off. We demonstrate the proposed methods on four numerical examples: a patrolling task with a change in system dynamics, a two-state MDP with periodically changing outcomes of actions, a wind flow estimation task, and a multi-armed bandit problem with periodically changing probabilities of different rewards.
Neighborhood Structure Assisted Non-negative Matrix Factorization and Its Application in Unsupervised Point-wise Anomaly Detection
http://jmlr.org/papers/v22/19-924.html
2021Imtiaz Ahmed, Xia Ben Hu, Mithun P. Acharya, Yu Ding
Dimensionality reduction is considered as an important step for ensuring competitive performance in unsupervised learning such as anomaly detection. Non-negative matrix factorization (NMF) is a widely used method to accomplish this goal. But NMF do not have the provision to include the neighborhood structure information and, as a result, may fail to provide satisfactory performance in presence of nonlinear manifold structure. To address this shortcoming, we propose to consider the neighborhood structural similarity information within the NMF framework and do so by modeling the data through a minimum spanning tree. We label the resulting method as the neighborhood structure-assisted NMF. We further develop both offline and online algorithms for implementing the proposed method. Empirical comparisons using twenty benchmark data sets as well as an industrial data set extracted from a hydropower plant demonstrate the superiority of the neighborhood structure-assisted NMF. Looking closer into the formulation and properties of the proposed NMF method and comparing it with several NMF variants reveal that inclusion of the MST-based neighborhood structure plays a key role in attaining the enhanced performance in anomaly detection.
Asynchronous Online Testing of Multiple Hypotheses
http://jmlr.org/papers/v22/19-910.html
2021Tijana Zrnic, Aaditya Ramdas, Michael I. Jordan
We consider the problem of asynchronous online testing, aimed at providing control of the false discovery rate (FDR) during a continual stream of data collection and testing, where each test may be a sequential test that can start and stop at arbitrary times. This setting increasingly characterizes real-world applications in science and industry, where teams of researchers across large organizations may conduct tests of hypotheses in a decentralized manner. The overlap in time and space also tends to induce dependencies among test statistics, a challenge for classical methodology, which either assumes (overly optimistically) independence or (overly pessimistically) arbitrary dependence between test statistics. We present a general framework that addresses both of these issues via a unified computational abstraction that we refer to as “conflict sets.” We show how this framework yields algorithms with formal FDR guarantees under a more intermediate, local notion of dependence. We illustrate our algorithms in simulations by comparing to existing algorithms for online FDR control.
Learning interaction kernels in heterogeneous systems of agents from multiple trajectories
http://jmlr.org/papers/v22/19-861.html
2021Fei Lu, Mauro Maggioni, Sui Tang
Systems of interacting particles, or agents, have wide applications in many disciplines, including Physics, Chemistry, Biology and Economics. These systems are governed by interaction laws, which are often unknown: estimating them from observation data is a fundamental task that can provide meaningful insights and accurate predictions of the behaviour of the agents. In this paper, we consider the inverse problem of learning interaction laws given data from multiple trajectories, in a nonparametric fashion, when the interaction kernels depend on pairwise distances. We establish a condition for learnability of interaction kernels, and construct an estimator based on the minimization of a suitably regularized least squares functional, that is guaranteed to converge, in a suitable $L^2$ space, at the optimal min-max rate for 1-dimensional nonparametric regression. We propose an efficient learning algorithm to construct such estimator, which can be implemented in parallel for multiple trajectories and is therefore well-suited for the high dimensional, big data regime. Numerical simulations on a variety examples, including opinion dynamics, predator-prey and swarm dynamics and heterogeneous particle dynamics, suggest that the learnability condition is satisfied in models used in practice, and the rate of convergence of our estimator is consistent with the theory. These simulations also suggest that our estimators are robust to noise in the observations, and can produce accurate predictions of trajectories in large time intervals, even when they are learned from observations in short time intervals.
FLAME: A Fast Large-scale Almost Matching Exactly Approach to Causal Inference
http://jmlr.org/papers/v22/19-853.html
2021Tianyu Wang, Marco Morucci, M. Usaid Awan, Yameng Liu, Sudeepa Roy, Cynthia Rudin, Alexander Volfovsky
A classical problem in causal inference is that of matching, where treatment units need to be matched to control units based on covariate information. In this work, we propose a method that computes high quality almost-exact matches for high-dimensional categorical datasets. This method, called FLAME (Fast Large-scale Almost Matching Exactly), learns a distance metric for matching using a hold-out training data set. In order to perform matching efficiently for large datasets, FLAME leverages techniques that are natural for query processing in the area of database management, and two implementations of FLAME are provided: the first uses SQL queries and the second uses bit-vector techniques. The algorithm starts by constructing matches of the highest quality (exact matches on all covariates), and successively eliminates variables in order to match exactly on as many variables as possible, while still maintaining interpretable high-quality matches and balance between treatment and control groups. We leverage these high quality matches to estimate conditional average treatment effects (CATEs). Our experiments show that FLAME scales to huge datasets with millions of observations where existing state-of-the-art methods fail, and that it achieves significantly better performance than other matching methods.
A Review of Robot Learning for Manipulation: Challenges, Representations, and Algorithms
http://jmlr.org/papers/v22/19-804.html
2021Oliver Kroemer, Scott Niekum, George Konidaris
A key challenge in intelligent robotics is creating robots that are capable of directly interacting with the world around them to achieve their goals. The last decade has seen substantial growth in research on the problem of robot manipulation, which aims to exploit the increasing availability of affordable robot arms and grippers to create robots capable of directly interacting with the world to achieve their goals. Learning will be central to such autonomous systems, as the real world contains too much variation for a robot to expect to have an accurate model of its environment, the objects in it, or the skills required to manipulate them, in advance. We aim to survey a representative subset of that research which uses machine learning for manipulation. We describe a formalization of the robot manipulation learning problem that synthesizes existing research into a single coherent framework and highlight the many remaining research opportunities and challenges.
Single and Multiple Change-Point Detection with Differential Privacy
http://jmlr.org/papers/v22/19-770.html
2021Wanrong Zhang, Sara Krehbiel, Rui Tuo, Yajun Mei, Rachel Cummings
The change-point detection problem seeks to identify distributional changes at an unknown change-point $k^*$ in a stream of data. This problem appears in many important practical settings involving personal data, including biosurveillance, fault detection, finance, signal detection, and security systems. The field of differential privacy offers data analysis tools that provide powerful worst-case privacy guarantees. We study the statistical problem of change-point detection through the lens of differential privacy. We give private algorithms for both online and offline change-point detection, analyze these algorithms theoretically, and provide empirical validation of our results.
Tsallis-INF: An Optimal Algorithm for Stochastic and Adversarial Bandits
http://jmlr.org/papers/v22/19-753.html
2021Julian Zimmert, Yevgeny Seldin
We derive an algorithm that achieves the optimal (within constants) pseudo-regret in both adversarial and stochastic multi-armed bandits without prior knowledge of the regime and time horizon. The algorithm is based on online mirror descent (OMD) with Tsallis entropy regularization with power $\alpha=1/2$ and reduced-variance loss estimators. More generally, we define an adversarial regime with a self-bounding constraint, which includes stochastic regime, stochastically constrained adversarial regime, and stochastic regime with adversarial corruptions as special cases, and show that the algorithm achieves logarithmic regret guarantee in this regime and all of its special cases simultaneously with the optimal regret guarantee in the adversarial regime. The algorithm also achieves adversarial and stochastic optimality in the utility-based dueling bandit setting. We provide empirical evaluation of the algorithm demonstrating that it significantly outperforms UCB1 and EXP3 in stochastic environments. We also provide examples of adversarial environments, where UCB1 and Thompson Sampling exhibit almost linear regret, whereas our algorithm suffers only logarithmic regret. To the best of our knowledge, this is the first example demonstrating vulnerability of Thompson Sampling in adversarial environments. Last but not least, we present a general stochastic analysis and a general adversarial analysis of OMD algorithms with Tsallis entropy regularization for $\alpha\in[0,1]$ and explain the reason why $\alpha=1/2$ works best.
Inference In High-dimensional Single-Index Models Under Symmetric Designs
http://jmlr.org/papers/v22/19-744.html
2021Hamid Eftekhari, Moulinath Banerjee, Ya'acov Ritov
The problem of statistical inference for regression coefficients in a high-dimensional single-index model is considered. Under elliptical symmetry, the single index model can be reformulated as a proxy linear model whose regression parameter is identifiable. We construct estimates of the regression coefficients of interest that are similar to the debiased lasso estimates in the standard linear model and exhibit similar properties: $\sqrt{n}$-consistency and asymptotic normality. The procedure completely bypasses the estimation of the unknown link function, which can be extremely challenging depending on the underlying structure of the problem. Furthermore, under Gaussianity, we propose more efficient estimates of the coefficients by expanding the link function in the Hermite polynomial basis. Finally, we illustrate our approach via carefully designed simulation experiments.
Finite Time LTI System Identification
http://jmlr.org/papers/v22/19-725.html
2021Tuhin Sarkar, Alexander Rakhlin, Munther A. Dahleh
We address the problem of learning the parameters of a stable linear time invariant (LTI) system with unknown latent space dimension, or order, from a single time--series of noisy input-output data. We focus on learning the best lower order approximation allowed by finite data. Motivated by subspace algorithms in systems theory, where the doubly infinite system Hankel matrix captures both order and good lower order approximations, we construct a Hankel-like matrix from noisy finite data using ordinary least squares. This circumvents the non-convexities that arise in system identification, and allows accurate estimation of the underlying LTI system. Our results rely on careful analysis of self-normalized martingale difference terms that helps bound identification error up to logarithmic factors of the lower bound. We provide a data-dependent scheme for order selection and find an accurate realization of system parameters, corresponding to that order, by an approach that is closely related to the Ho-Kalman subspace algorithm. We demonstrate that the proposed model order selection procedure is not overly conservative, i.e., for the given data length it is not possible to estimate higher order models or find higher order approximations with reasonable accuracy.
Generalization Performance of Multi-pass Stochastic Gradient Descent with Convex Loss Functions
http://jmlr.org/papers/v22/19-716.html
2021Yunwen Lei, Ting Hu, Ke Tang
Stochastic gradient descent (SGD) has become the method of choice to tackle large-scale datasets due to its low computational cost and good practical performance. Learning rate analysis, either capacity-independent or capacity-dependent, provides a unifying viewpoint to study the computational and statistical properties of SGD, as well as the implicit regularization by tuning the number of passes. Existing capacity-independent learning rates require a nontrivial bounded subgradient assumption and a smoothness assumption to be optimal. Furthermore, existing capacity-dependent learning rates are only established for the specific least squares loss with a special structure. In this paper, we provide both optimal capacity-independent and capacity-dependent learning rates for SGD with general convex loss functions. Our results require neither bounded subgradient assumptions nor smoothness assumptions, and are stated with high probability. We achieve this improvement by a refined estimate on the norm of SGD iterates based on a careful martingale analysis and concentration inequalities on empirical processes.
Entangled Kernels - Beyond Separability
http://jmlr.org/papers/v22/19-665.html
2021Riikka Huusari, Hachem Kadri
We consider the problem of operator-valued kernel learning and investigate the possibility of going beyond the well-known separable kernels. Borrowing tools and concepts from the field of quantum computing, such as partial trace and entanglement, we propose a new view on operator-valued kernels and define a general family of kernels that encompasses previously known operator-valued kernels, including separable and transformable kernels. Within this framework, we introduce another novel class of operator-valued kernels called entangled kernels that are not separable. We propose an efficient two-step algorithm for this framework, where the entangled kernel is learned based on a novel extension of kernel alignment to operator-valued kernels. We illustrate our algorithm with an application to supervised dimensionality reduction, and demonstrate its effectiveness with both artificial and real data for multi-output regression.
A Two-Level Decomposition Framework Exploiting First and Second Order Information for SVM Training Problems
http://jmlr.org/papers/v22/19-632.html
2021Giulio Galvan, Matteo Lapucci, Chih-Jen Lin, Marco Sciandrone
In this work we present a novel way to solve the sub-problems that originate when using decomposition algorithms to train Support Vector Machines (SVMs). State-of-the-art Sequential Minimization Optimization (SMO) solvers reduce the original problem to a sequence of sub-problems of two variables for which the solution is analytical. Although considering more than two variables at a time usually results in a lower number of iterations needed to train an SVM model, solving the sub-problem becomes much harder and the overall computational gains are limited, if any. We propose to apply the two-variables decomposition method to solve the sub-problems themselves and experimentally show that it is a viable and efficient way to deal with sub-problems of up to 50 variables. As a second contribution we explore different ways to select the working set and its size, combining first-order and second-order working set selection rules together with a strategy for exploiting cached elements of the Hessian matrix. An extensive numerical comparison shows that the method performs considerably better than state-of-the-art software.
When random initializations help: a study of variational inference for community detection
http://jmlr.org/papers/v22/19-630.html
2021Purnamrita Sarkar, Y. X. Rachel Wang, Soumendu S. Mukherjee
Variational approximation has been widely used in large-scale Bayesian inference recently, the simplest kind of which involves imposing a mean field assumption to approximate complicated latent structures. Despite the computational scalability of mean field, theoretical studies of its loss function surface and the convergence behavior of iterative updates for optimizing the loss are far from complete. In this paper, we focus on the problem of community detection for a simple two-class Stochastic Blockmodel (SBM) with equal class sizes. Using batch co-ordinate ascent (BCAVI) for updates, we show different convergence behavior with respect to different initializations. When the parameters are known or estimated within a reasonable range and held fixed, we characterize conditions under which an initialization can converge to the ground truth. On the other hand, when the parameters need to be estimated iteratively, a random initialization will converge to an uninformative local optimum.
A Fast Globally Linearly Convergent Algorithm for the Computation of Wasserstein Barycenters
http://jmlr.org/papers/v22/19-629.html
2021Lei Yang, Jia Li, Defeng Sun, Kim-Chuan Toh
We consider the problem of computing a Wasserstein barycenter for a set of discrete probability distributions with finite supports, which finds many applications in areas such as statistics, machine learning and image processing. When the support points of the barycenter are pre-specified, this problem can be modeled as a linear programming (LP) problem whose size can be extremely large. To handle this large-scale LP, we analyse the structure of its dual problem, which is conceivably more tractable and can be reformulated as a well-structured convex problem with 3 kinds of block variables and a coupling linear equality constraint. We then adapt a symmetric Gauss-Seidel based alternating direction method of multipliers (sGS-ADMM) to solve the resulting dual problem and establish its global convergence and global linear convergence rate. As a critical component for efficient computation, we also show how all the subproblems involved can be solved exactly and efficiently. This makes our method suitable for computing a Wasserstein barycenter on a large-scale data set, without introducing an entropy regularization term as is commonly practiced. In addition, our sGS-ADMM can be used as a subroutine in an alternating minimization method to compute a barycenter when its support points are not pre-specified. Numerical results on synthetic data sets and image data sets demonstrate that our method is highly competitive for solving large-scale Wasserstein barycenter problems, in comparison to two existing representative methods and the commercial software Gurobi.
Aggregated Hold-Out
http://jmlr.org/papers/v22/19-624.html
2021Guillaume Maillard, Sylvain Arlot, Matthieu Lerasle
Aggregated hold-out (agghoo) is a method which averages learning rules selected by hold-out (that is, cross-validation with a single split). We provide the first theoretical guarantees on agghoo, ensuring that it can be used safely: Agghoo performs at worst like the hold-out when the risk is convex. The same holds true in classification with the 0--1 risk, with an additional constant factor. For the hold-out, oracle inequalities are known for bounded losses, as in binary classification. We show that similar results can be proved, under appropriate assumptions, for other risk-minimization problems. In particular, we obtain an oracle inequality for regularized kernel regression with a Lipschitz loss, without requiring that the $Y$ variable or the regressors be bounded. Numerical experiments show that aggregation brings a significant improvement over the hold-out and that agghoo is competitive with cross-validation.
Ranking and synchronization from pairwise measurements via SVD
http://jmlr.org/papers/v22/19-542.html
2021Alexandre d'Aspremont, Mihai Cucuringu, Hemant Tyagi
Given a measurement graph $G= (V,E)$ and an unknown signal $r \in \mathbb{R}^n$, we investigate algorithms for recovering $r$ from pairwise measurements of the form $r_i - r_j$; $\{i,j\} \in E$. This problem arises in a variety of applications, such as ranking teams in sports data and time synchronization of distributed networks. Framed in the context of ranking, the task is to recover the ranking of $n$ teams (induced by $r$) given a small subset of noisy pairwise rank offsets. We propose a simple SVD-based algorithmic pipeline for both the problem of time synchronization and ranking. We provide a detailed theoretical analysis in terms of robustness against both sampling sparsity and noise perturbations with outliers, using results from matrix perturbation and random matrix theory. Our theoretical findings are complemented by a detailed set of numerical experiments on both synthetic and real data, showcasing the competitiveness of our proposed algorithms with other state-of-the-art methods.
A Unified Sample Selection Framework for Output Noise Filtering: An Error-Bound Perspective
http://jmlr.org/papers/v22/19-498.html
2021Gaoxia Jiang, Wenjian Wang, Yuhua Qian, Jiye Liang
The existence of output noise will bring difficulties to supervised learning. Noise filtering, aiming to detect and remove polluted samples, is one of the main ways to deal with the noise on outputs. However, most of the filters are heuristic and could not explain the filtering influence on the generalization error (GE) bound. The hyper-parameters in various filters are specified manually or empirically, and they are usually unable to adapt to the data environment. The filter with an improper hyper-parameter may overclean, leading to a weak generalization ability. This paper proposes a unified framework of optimal sample selection (OSS) for the output noise filtering from the perspective of error bound. The covering distance filter (CDF) under the framework is presented to deal with noisy outputs in regression and ordinal classification problems. Firstly, two necessary and sufficient conditions for a fixed goodness of fit in regression are deduced from the perspective of GE bound. They provide the unified theoretical framework for determining the filtering effectiveness and optimizing the size of removed samples. The optimal sample size has the adaptability to the environmental changes in the sample size, the noise ratio, and noise variance. It offers a choice of tuning the hyper-parameter and could prevent filters from overcleansing. Meanwhile, the OSS framework can be integrated with any noise estimator and produces a new filter. Then the covering interval is proposed to separate low-noise and high-noise samples, and the effectiveness is proved in regression. The covering distance is introduced as an unbiased estimator of high noises. Further, the CDF algorithm is designed by integrating the cover distance with the OSS framework. Finally, it is verified that the CDF not only recognizes noise labels correctly but also brings down the prediction errors on real apparent age data set. Experimental results on benchmark regression and ordinal classification data sets demonstrate that the CDF outperforms the state-of-the-art filters in terms of prediction ability, noise recognition, and efficiency.
Continuous Time Analysis of Momentum Methods
http://jmlr.org/papers/v22/19-466.html
2021Nikola B. Kovachki, Andrew M. Stuart
Gradient descent-based optimization methods underpin the parameter training of neural networks, and hence comprise a significant component in the impressive test results found in a number of applications. Introducing stochasticity is key to their success in practical problems, and there is some understanding of the role of stochastic gradient descent in this context. Momentum modifications of gradient descent such as Polyak's Heavy Ball method (HB) and Nesterov's method of accelerated gradients (NAG), are also widely adopted. In this work our focus is on understanding the role of momentum in the training of neural networks, concentrating on the common situation in which the momentum contribution is fixed at each step of the algorithm. To expose the ideas simply we work in the deterministic setting. Our approach is to derive continuous time approximations of the discrete algorithms; these continuous time approximations provide insights into the mechanisms at play within the discrete algorithms. We prove three such approximations. Firstly we show that standard implementations of fixed momentum methods approximate a time-rescaled gradient descent flow, asymptotically as the learning rate shrinks to zero; this result does not distinguish momentum methods from pure gradient descent, in the limit of vanishing learning rate. We then proceed to prove two results aimed at understanding the observed practical advantages of fixed momentum methods over gradient descent, when implemented in the non-asymptotic regime with fixed small, but non-zero, learning rate. We achieve this by proving approximations to continuous time limits in which the small but fixed learning rate appears as a parameter; this is known as the method of modified equations in the numerical analysis literature, recently rediscovered as the high resolution ODE approximation in the machine learning context. In our second result we show that the momentum method is approximated by a continuous time gradient flow, with an additional momentum-dependent second order time-derivative correction, proportional to the learning rate; this may be used to explain the stabilizing effect of momentum algorithms in their transient phase. Furthermore in a third result we show that the momentum methods admit an exponentially attractive invariant manifold on which the dynamics reduces, approximately, to a gradient flow with respect to a modified loss function, equal to the original loss function plus a small perturbation proportional to the learning rate; this small correction provides convexification of the loss function and encodes additional robustness present in momentum methods, beyond the transient phase.
Pykg2vec: A Python Library for Knowledge Graph Embedding
http://jmlr.org/papers/v22/19-433.html
2021Shih-Yuan Yu, Sujit Rokka Chhetri, Arquimedes Canedo, Palash Goyal, Mohammad Abdullah Al Faruque
Pykg2vec is a Python library for learning the representations of the entities and relations in knowledge graphs. Pykg2vec's flexible and modular software architecture currently implements 25 state-of-the-art knowledge graph embedding algorithms, and is designed to easily incorporate new algorithms.The goal of pykg2vec is to provide a practical and educational platform to accelerate research in knowledge graph representation learning. Pykg2vec is built on top of PyTorch and Python's multiprocessing framework and provides modules for batch generation, Bayesian hyperparameter optimization, evaluation of KGE tasks, embedding, and result visualization. Pykg2vec is released under the MIT License and is also available in the Python Package Index (PyPI). The source code of pykg2vec is available at https://github.com/Sujit-O/pykg2vec.
Simple and Fast Algorithms for Interactive Machine Learning with Random Counter-examples
http://jmlr.org/papers/v22/19-372.html
2021Jagdeep Singh Bhatia
This work describes simple and efficient algorithms for interactively learning non-binary concepts in the learning from random counter-examples (LRC) model. Here, learning takes place from random counter-examples that the learner receives in response to their proper equivalence queries, and the learning time is the number of counter-examples needed by the learner to identify the target concept. Such learning is particularly suited for online ranking, classification, clustering, etc., where machine learning models must be used before they are fully trained. We provide two simple LRC algorithms, deterministic and randomized, for exactly learning concepts from any concept class $H$. We show that both these algorithms have an $\mathcal{O}(\log{}|H|)$ asymptotically optimal average learning time. This solves an open problem on the existence of an efficient LRC randomized algorithm while also simplifying previous results and improving their computational efficiency. We also show that the expected learning time of any Arbitrary LRC algorithm can be upper bounded by $\mathcal{O}(\frac{1}{\epsilon}\log{\frac{|H|}{\delta}})$, where $\epsilon$ and $\delta$ are the allowed learning error and failure probability respectively. This shows that LRC interactive learning is at least as efficient as non-interactive Probably Approximately Correct (PAC) learning. Our simulations also show that these algorithms outperform their theoretical bounds.
On Multi-Armed Bandit Designs for Dose-Finding Trials
http://jmlr.org/papers/v22/19-228.html
2021Maryam Aziz, Emilie Kaufmann, Marie-Karelle Riviere
We study the problem of finding the optimal dosage in early stage clinical trials through the multi-armed bandit lens. We advocate the use of the Thompson Sampling principle, a flexible algorithm that can accommodate different types of monotonicity assumptions on the toxicity and efficacy of the doses. For the simplest version of Thompson Sampling, based on a uniform prior distribution for each dose, we provide finite-time upper bounds on the number of sub-optimal dose selections, which is unprecedented for dose-finding algorithms. Through a large simulation study, we then show that variants of Thompson Sampling based on more sophisticated prior distributions outperform state-of-the-art dose identification algorithms in different types of dose-finding studies that occur in phase I or phase I/II trials.
Homogeneity Structure Learning in Large-scale Panel Data with Heavy-tailed Errors
http://jmlr.org/papers/v22/19-1018.html
2021Xiao Di, Yuan Ke, Runze Li
Large-scale panel data is ubiquitous in many modern data science applications. Conventional panel data analysis methods fail to address the new challenges, like individual impacts of covariates, endogeneity, embedded low-dimensional structure, and heavy-tailed errors, arising from the innovation of data collection platforms on which applications operate. In response to these challenges, this paper studies large-scale panel data with an interactive effects model. This model takes into account the individual impacts of covariates on each spatial node and removes the exogenous condition by allowing latent factors to affect both covariates and errors. Besides, we waive the sub-Gaussian assumption and allow the errors to be heavy-tailed. Further, we propose a data-driven procedure to learn a parsimonious yet flexible homogeneity structure embedded in high-dimensional individual impacts of covariates. The homogeneity structure assumes that there exists a partition of regression coefficients where the coefficients are the same within each group but different between the groups. The homogeneity structure is flexible as it contains many widely assumed low-dimensional structures (sparsity, global impact, etc.) as its special cases. Non-asymptotic properties are established to justify the proposed learning procedure. Extensive numerical experiments demonstrate the advantage of the proposed learning procedure over conventional methods especially when the data are generated from heavy-tailed distributions.
Global and Quadratic Convergence of Newton Hard-Thresholding Pursuit
http://jmlr.org/papers/v22/19-026.html
2021Shenglong Zhou, Naihua Xiu, Hou-Duo Qi
Algorithms based on the hard thresholding principle have been well studied with sounding theoretical guarantees in the compressed sensing and more general sparsity-constrained optimization. It is widely observed in existing empirical studies that when a restricted Newton step was used (as the debiasing step), the hard-thresholding algorithms tend to meet halting conditions in a significantly low number of iterations and are very efficient. Hence, the thus obtained Newton hard-thresholding algorithms call for stronger theoretical guarantees than for their simple hard-thresholding counterparts. This paper provides a theoretical justification for the use of the restricted Newton step. We build our theory and algorithm, Newton Hard-Thresholding Pursuit (NHTP), for the sparsity-constrained optimization. Our main result shows that NHTP is quadratically convergent under the standard assumption of restricted strong convexity and smoothness. We also establish its global convergence to a stationary point under a weaker assumption. In the special case of the compressive sensing, NHTP effectively reduces to some of the existing hard-thresholding algorithms with a Newton step. Consequently, our fast convergence result justifies why those algorithms perform better than without the Newton step. The efficiency of NHTP was demonstrated on both synthetic and real data in compressed sensing and sparse logistic regression.
Unfolding-Model-Based Visualization: Theory, Method and Applications
http://jmlr.org/papers/v22/18-846.html
2021Yunxiao Chen, Zhiliang Ying, Haoran Zhang
Multidimensional unfolding methods are widely used for visualizing item response data. Such methods project respondents and items simultaneously onto a low-dimensional Euclidian space, in which respondents and items are represented by ideal points, with person-person, item-item, and person-item similarities being captured by the Euclidian distances between the points. In this paper, we study the visualization of multidimensional unfolding from a statistical perspective. We cast multidimensional unfolding into an estimation problem, where the respondent and item ideal points are treated as parameters to be estimated. An estimator is then proposed for the simultaneous estimation of these parameters. Asymptotic theory is provided for the recovery of the ideal points, shedding lights on the validity of model-based visualization. An alternating projected gradient descent algorithm is proposed for the parameter estimation. We provide two illustrative examples, one on users' movie rating and the other on senate roll call voting.
Mixing Time of Metropolis-Hastings for Bayesian Community Detection
http://jmlr.org/papers/v22/18-770.html
2021Bumeng Zhuo, Chao Gao
We study the computational complexity of a Metropolis-Hastings algorithm for Bayesian community detection. We first establish a posterior strong consistency result for a natural prior distribution on stochastic block models under the optimal signal-to-noise ratio condition in the literature. We then give a set of conditions that guarantee rapidly mixing of a simple Metropolis-Hastings algorithm. The mixing time analysis is based on a careful study of posterior ratios and a canonical path argument to control the spectral gap of the Markov chain.
Convex Clustering: Model, Theoretical Guarantee and Efficient Algorithm
http://jmlr.org/papers/v22/18-694.html
2021Defeng Sun, Kim-Chuan Toh, Yancheng Yuan
Clustering is a fundamental problem in unsupervised learning. Popular methods like K-means, may suffer from poor performance as they are prone to get stuck in its local minima. Recently, the sum-of-norms (SON) model (also known as the convex clustering model) has been proposed by Pelckmans et al. (2005), Lindsten et al. (2011) and Hocking et al. (2011). The perfect recovery properties of the convex clustering model with uniformly weighted all-pairwise-differences regularization have been proved by Zhu et al. (2014) and Panahi et al. (2017). However, no theoretical guarantee has been established for the general weighted convex clustering model, where better empirical results have been observed. In the numerical optimization aspect, although algorithms like the alternating direction method of multipliers (ADMM) and the alternating minimization algorithm (AMA) have been proposed to solve the convex clustering model (Chi and Lange, 2015), it still remains very challenging to solve large-scale problems. In this paper, we establish sufficient conditions for the perfect recovery guarantee of the general weighted convex clustering model, which include and improve existing theoretical results in (Zhu et al., 2014; Panahi et al., 2017) as special cases. In addition, we develop a semismooth Newton based augmented Lagrangian method for solving large-scale convex clustering problems. Extensive numerical experiments on both simulated and real data demonstrate that our algorithm is highly efficient and robust for solving large-scale problems. Moreover, the numerical results also show the superior performance and scalability of our algorithm comparing to the existing first-order methods. In particular, our algorithm is able to solve a convex clustering problem with 200,000 points in $\mathbb{R}^3$ in about 6 minutes.
A Unified Framework for Random Forest Prediction Error Estimation
http://jmlr.org/papers/v22/18-558.html
2021Benjamin Lu, Johanna Hardin
We introduce a unified framework for random forest prediction error estimation based on a novel estimator of the conditional prediction error distribution function. Our framework enables simple plug-in estimation of key prediction uncertainty metrics, including conditional mean squared prediction errors, conditional biases, and conditional quantiles, for random forests and many variants. Our approach is especially well-adapted for prediction interval estimation; we show via simulations that our proposed prediction intervals are competitive with, and in some settings outperform, existing methods. To establish theoretical grounding for our framework, we prove pointwise uniform consistency of a more stringent version of our estimator of the conditional prediction error distribution function. The estimators introduced here are implemented in the R package forestError.
Preference-based Online Learning with Dueling Bandits: A Survey
http://jmlr.org/papers/v22/18-546.html
2021Viktor Bengs, Róbert Busa-Fekete, Adil El Mesaoudi-Paul, Eyke Hüllermeier
In machine learning, the notion of multi-armed bandits refers to a class of online learning problems, in which an agent is supposed to simultaneously explore and exploit a given set of choice alternatives in the course of a sequential decision process. In the standard setting, the agent learns from stochastic feedback in the form of real-valued rewards. In many applications, however, numerical reward signals are not readily available---instead, only weaker information is provided, in particular relative preferences in the form of qualitative comparisons between pairs of alternatives. This observation has motivated the study of variants of the multi-armed bandit problem, in which more general representations are used both for the type of feedback to learn from and the target of prediction. The aim of this paper is to provide a survey of the state of the art in this field, referred to as preference-based multi-armed bandits or dueling bandits. To this end, we provide an overview of problems that have been considered in the literature as well as methods for tackling them. Our taxonomy is mainly based on the assumptions made by these methods about the data-generating process and, related to this, the properties of the preference-based feedback.
Consistent estimation of small masses in feature sampling
http://jmlr.org/papers/v22/18-534.html
2021Fadhel Ayed, Marco Battiston, Federico Camerlenghi, Stefano Favaro
Consider an (observable) random sample of size $n$ from an infinite population of individuals, each individual being endowed with a finite set of features from a collection of features $(F_{j})_{j\geq1}$ with unknown probabilities $(p_{j})_{j \geq 1}$, i.e., $p_{j}$ is the probability that an individual displays feature $F_{j}$. Under this feature sampling framework, in recent years there has been a growing interest in estimating the sum of the probability masses $p_{j}$'s of features observed with frequency $r\geq0$ in the sample, here denoted by $M_{n,r}$. This is the natural feature sampling counterpart of the classical problem of estimating small probabilities in the species sampling framework, where each individual is endowed with only one feature (or “species"). In this paper we study the problem of consistent estimation of the small mass $M_{n,r}$. We first show that there do not exist universally consistent estimators, in the multiplicative sense, of the missing mass $M_{n,0}$. Then, we introduce an estimator of $M_{n,r}$ and identify sufficient conditions under which the estimator is consistent. In particular, we propose a nonparametric estimator $\hat{M}_{n,r}$ of $M_{n,r}$ which has the same analytic form of the celebrated Good--Turing estimator for small probabilities, with the sole difference that the two estimators have different ranges (supports). Then, we show that $\hat{M}_{n,r}$ is strongly consistent, in the multiplicative sense, under the assumption that $(p_{j})_{j\geq1}$ has regularly varying heavy tails.
The Decoupled Extended Kalman Filter for Dynamic Exponential-Family Factorization Models
http://jmlr.org/papers/v22/18-417.html
2021Carlos A. Gomez-Uribe, Brian Karrer
Motivated by the needs of online large-scale recommender systems, we specialize the decoupled extended Kalman filter to factorization models, including factorization machines, matrix and tensor factorization, and illustrate the effectiveness of the approach through numerical experiments on synthetic and on real-world data. Online learning of model parameters through the decoupled extended Kalman filter makes factorization models more broadly useful by (i) allowing for more flexible observations through the entire exponential family, (ii) modeling parameter drift, and (iii) producing parameter uncertainty estimates that can enable explore/exploit and other applications. We use a different parameter dynamics than the standard decoupled extended Kalman filter, allowing parameter drift while encouraging reasonable values. We also present an alternate derivation of the extended Kalman filter and decoupled extended Kalman filter that highlights the role of the Fisher information matrix in the extended Kalman filter.
An Empirical Study of Bayesian Optimization: Acquisition Versus Partition
http://jmlr.org/papers/v22/18-220.html
2021Erich Merrill, Alan Fern, Xiaoli Fern, Nima Dolatnia
Bayesian optimization (BO) is a popular framework for black-box optimization. Two classes of BO approaches have shown promising empirical performance while providing strong theoretical guarantees. The first class optimizes an acquisition function to select points, which is typically computationally expensive and can only be done approximately. The second class of algorithms use systematic space partitioning, which is much cheaper computationally but the selection is typically less informed. This points to a potential trade-off between the computational complexity and empirical performance of these algorithms. The current literature, however, only provides a sparse sampling of empirical comparison points, giving little insight into this trade-off. The primary contribution of this work is to conduct a comprehensive, repeatable evaluation within a common software framework, which we provide as an open-source package. Our results give strong evidence about the relative performance of these methods and reveal a consistent top performer, even when accounting for overall computation time.
Regulating Greed Over Time in Multi-Armed Bandits
http://jmlr.org/papers/v22/17-720.html
2021Stefano Tracà, Cynthia Rudin, Weiyu Yan
In retail, there are predictable yet dramatic time-dependent patterns in customer behavior, such as periodic changes in the number of visitors, or increases in customers just before major holidays. The current paradigm of multi-armed bandit analysis does not take these known patterns into account. This means that for applications in retail, where prices are fixed for periods of time, current bandit algorithms will not suffice. This work provides a remedy that takes the time-dependent patterns into account, and we show how this remedy is implemented for the UCB, $\varepsilon$-greedy, and UCB-L algorithms, and also through a new policy called the variable arm pool algorithm. In the corrected methods, exploitation (greed) is regulated over time, so that more exploitation occurs during higher reward periods, and more exploration occurs in periods of low reward. In order to understand why regret is reduced with the corrected methods, we present a set of bounds that provide insight into why we would want to exploit during periods of high reward, and discuss the impact on regret. Our proposed methods perform well in experiments, and were inspired by a high-scoring entry in the Exploration and Exploitation 3 contest using data from Yahoo$!$ Front Page. That entry heavily used time-series methods to regulate greed over time, which was substantially more effective than other contextual bandit methods.
Domain Generalization by Marginal Transfer Learning
http://jmlr.org/papers/v22/17-679.html
2021Gilles Blanchard, Aniket Anand Deshmukh, Urun Dogan, Gyemin Lee, Clayton Scott
In the problem of domain generalization (DG), there are labeled training data sets from several related prediction problems, and the goal is to make accurate predictions on future unlabeled data sets that are not known to the learner. This problem arises in several applications where data distributions fluctuate because of environmental, technical, or other sources of variation. We introduce a formal framework for DG, and argue that it can be viewed as a kind of supervised learning problem by augmenting the original feature space with the marginal distribution of feature vectors. While our framework has several connections to conventional analysis of supervised learning algorithms, several unique aspects of DG require new methods of analysis. This work lays the learning theoretic foundations of domain generalization, building on our earlier conference paper where the problem of DG was introduced. We present two formal models of data generation, corresponding notions of risk, and distribution-free generalization error analysis. By focusing our attention on kernel methods, we also provide more quantitative results and a universally consistent algorithm. An efficient implementation is provided for this algorithm, which is experimentally compared to a pooling strategy on one synthetic and three real-world data sets.
On the Optimality of Kernel-Embedding Based Goodness-of-Fit Tests
http://jmlr.org/papers/v22/17-570.html
2021Krishnakumar Balasubramanian, Tong Li, Ming Yuan
The reproducing kernel Hilbert space (RKHS) embedding of distributions offers a general and flexible framework for testing problems in arbitrary domains and has attracted considerable amount of attention in recent years. To gain insights into their operating characteristics, we study here the statistical performance of such approaches within a minimax framework. Focusing on the case of goodness-of-fit tests, our analyses show that a vanilla version of the kernel embedding based test could be minimax suboptimal, {when considering $\chi^2$ distance as the separation metric}. Hence we suggest a simple remedy by moderating the embedding. We prove that the moderated approach provides optimal tests for a wide range of deviations from the null and can also be made adaptive over a large collection of interpolation spaces. Numerical experiments are presented to further demonstrate the merits of our approach.