http://www.jmlr.org
JMLRJournal of Machine Learning Research
Rethinking Discount Regularization: New Interpretations, Unintended Consequences, and Solutions for Regularization in Reinforcement Learning
http://jmlr.org/papers/v25/24-0087.html
http://jmlr.org/papers/volume25/24-0087/24-0087.pdf
2024Sarah Rathnam, Sonali Parbhoo, Siddharth Swaroop, Weiwei Pan, Susan A. Murphy, Finale Doshi-Velez
Discount regularization, using a shorter planning horizon when calculating the optimal policy, is a popular choice to avoid overfitting when faced with sparse or noisy data. It is commonly interpreted as de-emphasizing or ignoring delayed effects. In this paper, we prove two alternative views of discount regularization that expose unintended consequences and motivate novel regularization methods. In model-based RL, planning under a lower discount factor acts like a prior with stronger regularization on state-action pairs with more transition data. This leads to poor performance when the transition matrix is estimated from data sets with uneven amounts of data across state-action pairs. In model-free RL, discount regularization equates to planning using a weighted average Bellman update, where the agent plans as if the values of all state-action pairs are closer than implied by the data. Our equivalence theorems motivate simple methods that generalize discount regularization by setting parameters locally for individual state-action pairs rather than globally. We demonstrate the failures of discount regularization and how we remedy them using our state-action-specific methods across empirical examples with both tabular and continuous state spaces.
PromptBench: A Unified Library for Evaluation of Large Language Models
http://jmlr.org/papers/v25/24-0023.html
http://jmlr.org/papers/volume25/24-0023/24-0023.pdf
2024Kaijie Zhu, Qinlin Zhao, Hao Chen, Jindong Wang, Xing Xie
The evaluation of large language models (LLMs) is crucial to assess their performance and mitigate potential security risks. In this paper, we introduce PromptBench, a unified library to evaluate LLMs. It consists of several key components that can be easily used and extended by researchers: prompt construction, prompt engineering, dataset and model loading, adversarial prompt attack, dynamic evaluation protocols, and analysis tools. PromptBench is designed as an open, general, and flexible codebase for research purpose. It aims to facilitate original study in creating new benchmarks, deploying downstream applications, and designing new evaluation protocols. The code is available at: https://github.com/microsoft/promptbench and will be continuously supported.
Gaussian Interpolation Flows
http://jmlr.org/papers/v25/23-1515.html
http://jmlr.org/papers/volume25/23-1515/23-1515.pdf
2024Yuan Gao, Jian Huang, and Yuling Jiao
Gaussian denoising has emerged as a powerful method for constructing simulation-free continuous normalizing flows for generative modeling. Despite their empirical successes, theoretical properties of these flows and the regularizing effect of Gaussian denoising have remained largely unexplored. In this work, we aim to address this gap by investigating the well-posedness of simulation-free continuous normalizing flows built on Gaussian denoising. Through a unified framework termed Gaussian interpolation flow, we establish the Lipschitz regularity of the flow velocity field, the existence and uniqueness of the flow, and the Lipschitz continuity of the flow map and the time-reversed flow map for several rich classes of target distributions. This analysis also sheds light on the auto-encoding and cycle consistency properties of Gaussian interpolation flows. Additionally, we study the stability of these flows in source distributions and perturbations of the velocity field, using the quadratic Wasserstein distance as a metric. Our findings offer valuable insights into the learning techniques employed in Gaussian interpolation flows for generative modeling, providing a solid theoretical foundation for end-to-end error analyses of learning Gaussian interpolation flows with empirical observations.
Gaussian Mixture Models with Rare Events
http://jmlr.org/papers/v25/23-1245.html
http://jmlr.org/papers/volume25/23-1245/23-1245.pdf
2024Xuetong Li, Jing Zhou, Hansheng Wang
We study here a Gaussian mixture model (GMM) with rare events data. In this case, the commonly used Expectation-Maximization (EM) algorithm exhibits extremely slow numerical convergence rate. To theoretically understand this phenomenon, we formulate the numerical convergence problem of the EM algorithm with rare events data as a problem about a contraction operator. Theoretical analysis reveals that the spectral radius of the contraction operator in this case could be arbitrarily close to 1 asymptotically. This theoretical finding explains the empirical slow numerical convergence of the EM algorithm with rare events data. To overcome this challenge, a Mixed EM (MEM) algorithm is developed, which utilizes the information provided by partially labeled data. As compared with the standard EM algorithm, the key feature of the MEM algorithm is that it requires additionally labeled data. We find that MEM algorithm significantly improves the numerical convergence rate as compared with the standard EM algorithm. The finite sample performance of the proposed method is illustrated by both simulation studies and a real-world dataset of Swedish traffic signs.
On the Concentration of the Minimizers of Empirical Risks
http://jmlr.org/papers/v25/23-1149.html
http://jmlr.org/papers/volume25/23-1149/23-1149.pdf
2024Paul Escande
Obtaining guarantees on the convergence of the minimizers of empirical risks to the ones of the true risk is a fundamental matter in statistical learning. Instead of deriving guarantees on the usual estimation error, the goal of this paper is to provide concentration inequalities on the distance between the sets of minimizers of the risks for a broad spectrum of estimation problems. In particular, the risks are defined on metric spaces through probability measures that are also supported on metric spaces. A particular attention will therefore be given to include unbounded spaces and non-convex cost functions that might also be unbounded. This work identifies a set of high-level assumptions allowing to describe a regime that seems to govern the concentration in many estimation problems, where the empirical minimizers are stable. This stability can then be leveraged to prove parametric concentration rates in probability and in expectation. The assumptions are verified, and the bounds showcased, on a selection of estimation problems such as barycenters on metric space with positive or negative curvature, subspaces of covariance matrices, regression problems and entropic-Wasserstein barycenters.
Variance estimation in graphs with the fused lasso
http://jmlr.org/papers/v25/23-1061.html
http://jmlr.org/papers/volume25/23-1061/23-1061.pdf
2024Oscar Hernan Madrid Padilla
We study the problem of variance estimation in general graph-structured problems. First, we develop a linear time estimator for the homoscedastic case that can consistently estimate the variance in general graphs. We show that our estimator attains minimax rates for the chain and 2D grid graphs when the mean signal has total variation with canonical scaling. Furthermore, we provide general upper bounds on the mean squared error performance of the fused lasso estimator in general graphs under a moment condition and a bound on the tail behavior of the errors. These upper bounds allow us to generalize for broader classes of distributions, such as sub-Exponential, many existing results on the fused lasso that are only known to hold with the assumption that errors are sub-Gaussian random variables. Exploiting our upper bounds, we then study a simple total variation regularization estimator for estimating the signal of variances in the heteroscedastic case. We also provide lower bounds showing that our heteroscedastic variance estimator attains minimax rates for estimating signals of bounded variation in grid graphs, and $K$-nearest neighbor graphs, and the estimator is consistent for estimating the variances in any connected graph.
Random measure priors in Bayesian recovery from sketches
http://jmlr.org/papers/v25/23-1058.html
http://jmlr.org/papers/volume25/23-1058/23-1058.pdf
2024Mario Beraha, Stefano Favaro, Matteo Sesia
This paper introduces a Bayesian nonparametric approach to frequency recovery from lossy-compressed discrete data, leveraging all information contained in a sketch obtained through random hashing. By modeling the data points as random samples from an unknown discrete distribution endowed with a Poisson-Kingman prior, we derive the posterior distribution of a symbol's empirical frequency given the sketch. This leads to principled frequency estimates through mean functionals, e.g., the posterior mean, median and mode. We highlight applications of this general result to Dirichlet process and Pitman-Yor process priors. Notably, we prove that the former prior uniquely satisfies a sufficiency property that simplifies the posterior distribution, while the latter enables a convenient large-sample asymptotic approximation. Additionally, we extend our approach to the problem of cardinality recovery, estimating the number of distinct symbols in the sketched dataset. Our approach to frequency recovery also adapts to a more general “traits” setting, where each data point has integer levels of association with multiple symbols, typically referred to as “traits”. By employing a generalized Indian buffet process, we compute the posterior distribution of a trait's frequency using both the Poisson and Bernoulli distributions for the trait association levels, respectively yielding exact and approximate posterior frequency distributions.
From continuous-time formulations to discretization schemes: tensor trains and robust regression for BSDEs and parabolic PDEs
http://jmlr.org/papers/v25/23-0982.html
http://jmlr.org/papers/volume25/23-0982/23-0982.pdf
2024Lorenz Richter, Leon Sallandt, Nikolas Nüsken
The numerical approximation of partial differential equations (PDEs) poses formidable challenges in high dimensions since classical grid-based methods suffer from the so-called curse of dimensionality. Recent attempts rely on a combination of Monte Carlo methods and variational formulations, using neural networks for function approximation. Extending previous work (Richter et al., 2021), we argue that tensor trains provide an appealing framework for parabolic PDEs: The combination of reformulations in terms of backward stochastic differential equations and regression-type methods holds the promise of leveraging latent low-rank structures, enabling both compression and efficient computation. Emphasizing a continuous-time viewpoint, we develop iterative schemes, which differ in terms of computational efficiency and robustness. We demonstrate both theoretically and numerically that our methods can achieve a favorable trade-off between accuracy and computational efficiency. While previous methods have been either accurate or fast, we have identified a novel numerical strategy that can often combine both of these aspects.
Label Alignment Regularization for Distribution Shift
http://jmlr.org/papers/v25/23-0899.html
http://jmlr.org/papers/volume25/23-0899/23-0899.pdf
2024Ehsan Imani, Guojun Zhang, Runjia Li, Jun Luo, Pascal Poupart, Philip H.S. Torr, Yangchen Pan
Recent work has highlighted the label alignment property (LAP) in supervised learning, where the vector of all labels in the dataset is mostly in the span of the top few singular vectors of the data matrix. Drawing inspiration from this observation, we propose a regularization method for unsupervised domain adaptation that encourages alignment between the predictions in the target domain and its top singular vectors. Unlike conventional domain adaptation approaches that focus on regularizing representations, we instead regularize the classifier to align with the unsupervised target data, guided by the LAP in both the source and target domains. Theoretical analysis demonstrates that, under certain assumptions, our solution resides within the span of the top right singular vectors of the target domain data and aligns with the optimal solution. By removing the reliance on the commonly used optimal joint risk assumption found in classic domain adaptation theory, we showcase the effectiveness of our method on addressing problems where traditional domain adaptation methods often fall short due to high joint error. Additionally, we report improved performance over domain adaptation baselines in well-known tasks such as MNIST-USPS domain adaptation and cross-lingual sentiment analysis. An implementation is available at https://github.com/EhsanEI/lar/.
Fairness in Survival Analysis with Distributionally Robust Optimization
http://jmlr.org/papers/v25/23-0888.html
http://jmlr.org/papers/volume25/23-0888/23-0888.pdf
2024Shu Hu, George H. Chen
We propose a general approach for encouraging fairness in survival analysis models that is based on minimizing a worst-case error across all subpopulations that are “large enough” (occurring with at least a user-specified probability threshold). This approach can be used to convert a wide variety of existing survival analysis models into ones that simultaneously encourage fairness, without requiring the user to specify which attributes or features to treat as sensitive in the training loss function. From a technical standpoint, our approach applies recent methodological developments of distributionally robust optimization (DRO) to survival analysis. The complication is that existing DRO theory uses a training loss function that decomposes across contributions of individual data points, i.e., any term that shows up in the loss function depends only on a single training point. This decomposition does not hold for commonly used survival loss functions, including for the standard Cox proportional hazards model, its deep neural network variants, and many other recently developed survival analysis models that use loss functions involving ranking or similarity score calculations. We address this technical hurdle using a sample splitting strategy. We demonstrate our sample splitting DRO approach by using it to create fair versions of a diverse set of existing survival analysis models including the classical Cox model (and its deep neural network variant DeepSurv), the discrete-time model DeepHit, and the neural ODE model SODEN. We also establish a finite-sample theoretical guarantee to show what our sample splitting DRO loss converges to. Specifically for the Cox model, we further derive an exact DRO approach that does not use sample splitting. For all the survival models that we convert into DRO variants, we show that the DRO variants often score better on recently established fairness metrics (without incurring a significant drop in accuracy) compared to existing survival analysis fairness regularization techniques, including ones which directly use sensitive demographic information in their training loss functions.
FineMorphs: Affine-Diffeomorphic Sequences for Regression
http://jmlr.org/papers/v25/23-0824.html
http://jmlr.org/papers/volume25/23-0824/23-0824.pdf
2024Michele Lohr, Laurent Younes
A multivariate regression model of affine and diffeomorphic transformation sequences—FineMorphs—is presented. Leveraging concepts from shape analysis, model states are optimally "reshaped" by diffeomorphisms generated by smooth vector fields during learning. Affine transformations and vector fields are optimized within an optimal control setting, and the model can naturally reduce (or increase) dimensionality and adapt to large data sets via sub-optimal vector fields. An existence proof of solution and necessary conditions for optimality for the model are derived. Experimental results on real data sets from the UCI repository are presented, with favorable results in comparison with state-of-the-art in the literature, neural ordinary differential equation models, and densely-connected neural networks in TensorFlow.
Tensor-train methods for sequential state and parameter learning in state-space models
http://jmlr.org/papers/v25/23-0743.html
http://jmlr.org/papers/volume25/23-0743/23-0743.pdf
2024Yiran Zhao, Tiangang Cui
We consider sequential state and parameter learning in state-space models with intractable state transition and observation processes. By exploiting low-rank tensor train (TT) decompositions, we propose new sequential learning methods for joint parameter and state estimation under the Bayesian framework. Our key innovation is the introduction of scalable function approximation tools such as TT for recursively learning the sequentially updated posterior distributions. The function approximation perspective of our methods offers tractable error analysis and potentially alleviates the particle degeneracy faced by many particle-based methods. In addition to the new insights into the algorithmic design, our methods complement conventional particle-based methods. Our TT-based approximations naturally define conditional Knothe--Rosenblatt (KR) rearrangements that lead to parameter estimation, filtering, smoothing and path estimation accompanying our sequential learning algorithms, which open the door to removing potential approximation bias. We also explore several preconditioning techniques based on either linear or nonlinear KR rearrangements to enhance the approximation power of TT for practical problems. We demonstrate the efficacy and efficiency of our proposed methods on several state-space models, in which our methods achieve state-of-the-art estimation accuracy and computational performance.
Memory of recurrent networks: Do we compute it right?
http://jmlr.org/papers/v25/23-0568.html
http://jmlr.org/papers/volume25/23-0568/23-0568.pdf
2024Giovanni Ballarin, Lyudmila Grigoryeva, Juan-Pablo Ortega
Numerical evaluations of the memory capacity (MC) of recurrent neural networks reported in the literature often contradict well-established theoretical bounds. In this paper, we study the case of linear echo state networks, for which the total memory capacity has been proven to be equal to the rank of the corresponding Kalman controllability matrix. We shed light on various reasons for the inaccurate numerical estimations of the memory, and we show that these issues, often overlooked in the recent literature, are of an exclusively numerical nature. More explicitly, we prove that when the Krylov structure of the linear MC is ignored, a gap between the theoretical MC and its empirical counterpart is introduced. As a solution, we develop robust numerical approaches by exploiting a result of MC neutrality with respect to the input mask matrix. Simulations show that the memory curves that are recovered using the proposed methods fully agree with the theory.
The Loss Landscape of Deep Linear Neural Networks: a Second-order Analysis
http://jmlr.org/papers/v25/23-0493.html
http://jmlr.org/papers/volume25/23-0493/23-0493.pdf
2024El Mehdi Achour, François Malgouyres, Sébastien Gerchinovitz
We study the optimization landscape of deep linear neural networks with square loss. It is known that, under weak assumptions, there are no spurious local minima and no local maxima. However, the existence and diversity of non-strict saddle points, which can play a role in first-order algorithms' dynamics, have only been lightly studied. We go a step further with a complete analysis of the optimization landscape at order $2$. Among all critical points, we characterize global minimizers, strict saddle points, and non-strict saddle points. We enumerate all the associated critical values. The characterization is simple, involves conditions on the ranks of partial matrix products, and sheds some light on global convergence or implicit regularization that has been proved or observed when optimizing linear neural networks. In passing, we provide an explicit parameterization of the set of all global minimizers and exhibit large sets of strict and non-strict saddle points.
High Probability Convergence Bounds for Non-convex Stochastic Gradient Descent with Sub-Weibull Noise
http://jmlr.org/papers/v25/23-0466.html
http://jmlr.org/papers/volume25/23-0466/23-0466.pdf
2024Liam Madden, Emiliano Dall'Anese, Stephen Becker
Stochastic gradient descent is one of the most common iterative algorithms used in machine learning and its convergence analysis is a rich area of research. Understanding its convergence properties can help inform what modifications of it to use in different settings. However, most theoretical results either assume convexity or only provide convergence results in mean. This paper, on the other hand, proves convergence bounds in high probability without assuming convexity. Assuming strong smoothness, we prove high probability convergence bounds in two settings: (1) assuming the Polyak-Łojasiewicz inequality and norm sub-Gaussian gradient noise and (2) assuming norm sub-Weibull gradient noise. In the second setting, as an intermediate step to proving convergence, we prove a sub-Weibull martingale difference sequence self-normalized concentration inequality of independent interest. It extends Freedman-type concentration beyond the sub-exponential threshold to heavier-tailed martingale difference sequences. We also provide a post-processing method that picks a single iterate with a provable convergence guarantee as opposed to the usual bound for the unknown best iterate. Our convergence result for sub-Weibull noise extends the regime where stochastic gradient descent has equal or better convergence guarantees than stochastic gradient descent with modifications such as clipping, momentum, and normalization.
Euler Characteristic Tools for Topological Data Analysis
http://jmlr.org/papers/v25/23-0353.html
http://jmlr.org/papers/volume25/23-0353/23-0353.pdf
2024Olympio Hacquard, Vadim Lebovici
In this article, we study Euler characteristic techniques in topological data analysis. Pointwise computing the Euler characteristic of a family of simplicial complexes built from data gives rise to the so-called Euler characteristic profile. We show that this simple descriptor achieves state-of-the-art performance in supervised tasks at a meagre computational cost. Inspired by signal analysis, we compute hybrid transforms of Euler characteristic profiles. These integral transforms mix Euler characteristic techniques with Lebesgue integration to provide highly efficient compressors of topological signals. As a consequence, they show remarkable performances in unsupervised settings. On the qualitative side, we provide numerous heuristics on the topological and geometric information captured by Euler profiles and their hybrid transforms. Finally, we prove stability results for these descriptors as well as asymptotic guarantees in random settings.
Depth Degeneracy in Neural Networks: Vanishing Angles in Fully Connected ReLU Networks on Initialization
http://jmlr.org/papers/v25/23-0350.html
http://jmlr.org/papers/volume25/23-0350/23-0350.pdf
2024Cameron Jakub, Mihai Nica
Despite remarkable performance on a variety of tasks, many properties of deep neural networks are not yet theoretically understood. One such mystery is the depth degeneracy phenomenon: the deeper you make your network, the closer your network is to a constant function on initialization. In this paper, we examine the evolution of the angle between two inputs to a ReLU neural network as a function of the number of layers. By using combinatorial expansions, we find precise formulas for how fast this angle goes to zero as depth increases. These formulas capture microscopic fluctuations that are not visible in the popular framework of infinite width limits, and leads to qualitatively different predictions. We validate our theoretical results with Monte Carlo experiments and show that our results accurately approximate finite network behaviour. We also empirically investigate how the depth degeneracy phenomenon can negatively impact training of real networks. The formulas are given in terms of the mixed moments of correlated Gaussians passed through the ReLU function. We also find a surprising combinatorial connection between these mixed moments and the Bessel numbers that allows us to explicitly evaluate these moments.
Fortuna: A Library for Uncertainty Quantification in Deep Learning
http://jmlr.org/papers/v25/23-0145.html
http://jmlr.org/papers/volume25/23-0145/23-0145.pdf
2024Gianluca Detommaso, Alberto Gasparin, Michele Donini, Matthias Seeger, Andrew Gordon Wilson, Cedric Archambeau
We present Fortuna, an open-source library for uncertainty quantification in deep learning. Fortuna supports a range of calibration techniques, such as conformal prediction that can be applied to any trained neural network to generate reliable uncertainty estimates, and scalable Bayesian inference methods that can be applied to deep neural networks trained from scratch for improved uncertainty quantification and accuracy. By providing a coherent framework for advanced uncertainty quantification methods, Fortuna simplifies the process of benchmarking and helps practitioners build robust AI systems.
Characterization of translation invariant MMD on Rd and connections with Wasserstein distances
http://jmlr.org/papers/v25/22-1338.html
http://jmlr.org/papers/volume25/22-1338/22-1338.pdf
2024Thibault Modeste, Clément Dombry
Kernel mean embeddings and maximum mean discrepancies (MMD) associated with positive definite kernels are important tools in machine learning that allow to compare probability measures and sample distributions. We provide a full characterization of translation invariant MMDs on $\mathbb{R}^d$ that are parametrized by a spectral measure and a semi-definite positive symmetric matrix. Furthermore, we investigate the connections between translation invariant MMDs and Wasserstein distances on $\mathbb{R}^d$. We show in particular that convergence with respect to the MMD associated with the Energy Kernel of order $\alpha\in(0,1)$ implies convergence with respect to the Wasserstein distance of order $\beta<\alpha$. We also provide examples of kernels metrizing the Wasserstein space of order $\alpha\geq 1$. A short numerical experiment illustrates our findings in the framework of the one-sample-test.
On the Hyperparameters in Stochastic Gradient Descent with Momentum
http://jmlr.org/papers/v25/22-1189.html
http://jmlr.org/papers/volume25/22-1189/22-1189.pdf
2024Bin Shi
Following the same routine as Shi et al. (2023), we continue to present the theoretical analysis for stochastic gradient descent with momentum (SGD with momentum) in this paper. Differently, for SGD with momentum, we demonstrate that the two hyperparameters together, the learning rate and the momentum coefficient, play a significant role in the linear convergence rate in non-convex optimizations. Our analysis is based on using a hyperparameters-dependent stochastic differential equation (hp-dependent SDE) that serves as a continuous surrogate for SGD with momentum. Similarly, we establish the linear convergence for the continuous-time formulation of SGD with momentum and obtain an explicit expression for the optimal linear rate by analyzing the spectrum of the Kramers-Fokker-Planck operator. By comparison, we demonstrate how the optimal linear rate of convergence and the final gap for SGD only about the learning rate varies with the momentum coefficient increasing from zero to one when the momentum is introduced. Then, we propose a mathematical interpretation of why, in practice, SGD with momentum converges faster and is more robust in the learning rate than standard stochastic gradient descent (SGD). Finally, we show the Nesterov momentum under the presence of noise has no essential difference from the traditional momentum.
Improved Random Features for Dot Product Kernels
http://jmlr.org/papers/v25/22-0118.html
http://jmlr.org/papers/volume25/22-0118/22-0118.pdf
2024Jonas Wacker, Motonobu Kanagawa, Maurizio Filippone
Dot product kernels, such as polynomial and exponential (softmax) kernels, are among the most widely used kernels in machine learning, as they enable modeling the interactions between input features, which is crucial in applications like computer vision, natural language processing, and recommender systems. We make several novel contributions for improving the efficiency of random feature approximations for dot product kernels, to make these kernels more useful in large scale learning. First, we present a generalization of existing random feature approximations for polynomial kernels, such as Rademacher and Gaussian sketches and TensorSRHT, using complex-valued random features. We show empirically that the use of complex features can significantly reduce the variances of these approximations. Second, we provide a theoretical analysis for understanding the factors affecting the efficiency of various random feature approximations, by deriving closed-form expressions for their variances. These variance formulas elucidate conditions under which certain approximations (e.g., TensorSRHT) achieve lower variances than others (e.g., Rademacher sketches), and conditions under which the use of complex features leads to lower variances than real features. Third, by using these variance formulas, which can be evaluated in practice, we develop a data-driven optimization approach to improve random feature approximations for general dot product kernels, which is also applicable to the Gaussian kernel. We describe the improvements brought by these contributions with extensive experiments on a variety of tasks and datasets.
Regret Analysis of Bilateral Trade with a Smoothed Adversary
http://jmlr.org/papers/v25/23-1627.html
http://jmlr.org/papers/volume25/23-1627/23-1627.pdf
2024Nicolò Cesa-Bianchi, Tommaso Cesari, Roberto Colomboni, Federico Fusco, Stefano Leonardi
We study repeated bilateral trade where an adaptive $\sigma$-smooth adversary generates the valuations of sellers and buyers. We completely characterize the regret regimes for fixed-price mechanisms under different feedback models in the two cases where the learner can post the same or different prices to buyers and sellers. We begin by showing that, in the full-feedback scenario, the minimax regret after $T$ rounds is of order $\sqrt{T}$. Under partial feedback, any algorithm that has to post the same price to buyers and sellers suffers worst-case linear regret. However, when the learner can post two different prices at each round, we design an algorithm enjoying regret of order $T^{3/4}$, ignoring log factors. We prove that this rate is optimal by presenting a surprising $T^{3/4}$ lower bound, which is the paper's main technical contribution.
Invariant Physics-Informed Neural Networks for Ordinary Differential Equations
http://jmlr.org/papers/v25/23-1511.html
http://jmlr.org/papers/volume25/23-1511/23-1511.pdf
2024Shivam Arora, Alex Bihlo, Francis Valiquette
Physics-informed neural networks have emerged as a prominent new method for solving differential equations. While conceptually straightforward, they often suffer training difficulties that lead to relatively large discretization errors or the failure to obtain correct solutions. In this paper we introduce invariant physics-informed neural networks for ordinary differential equations that admit a finite-dimensional group of Lie point symmetries. Using the method of equivariant moving frames, a differential equation is invariantized to obtain a, generally, simpler equation in the space of differential invariants. A solution to the invariantized equation is then mapped back to a solution of the original differential equation by solving the reconstruction equations for the left moving frame. The invariantized differential equation together with the reconstruction equations are solved using a physics-informed neural network, and form what we call an invariant physics-informed neural network. We illustrate the method with several examples, all of which considerably outperform standard non-invariant physics-informed neural networks.
Distribution Learning via Neural Differential Equations: A Nonparametric Statistical Perspective
http://jmlr.org/papers/v25/23-1280.html
http://jmlr.org/papers/volume25/23-1280/23-1280.pdf
2024Youssef Marzouk, Zhi (Robert) Ren, Sven Wang, Jakob Zech
Ordinary differential equations (ODEs), via their induced flow maps, provide a powerful framework to parameterize invertible transformations for representing complex probability distributions. While such models have achieved enormous success in machine learning, little is known about their statistical properties. This work establishes the first general nonparametric statistical convergence analysis for distribution learning via ODE models trained through likelihood maximization. We first prove a convergence theorem applicable to arbitrary velocity field classes $\mathcal{F}$ satisfying certain simple boundary constraints. This general result captures the trade-off between the approximation error and complexity of the ODE model. We show that the latter can be quantified via the $C^1$-metric entropy of the class $\mathcal{F}$. We then apply this general framework to the setting of $C^k$-smooth target densities, and establish nearly minimax-optimal convergence rates for two relevant velocity field classes $\mathcal{F}$: $C^k$ functions and neural networks. The latter is the practically important case of neural ODEs. Our results also provide insight on how the choice of velocity field class, and the dependence of this choice on sample size (e.g., the scaling of neural network classes), impact statistical performance.
Variation Spaces for Multi-Output Neural Networks: Insights on Multi-Task Learning and Network Compression
http://jmlr.org/papers/v25/23-0677.html
http://jmlr.org/papers/volume25/23-0677/23-0677.pdf
2024Joseph Shenouda, Rahul Parhi, Kangwook Lee, Robert D. Nowak
This paper introduces a novel theoretical framework for the analysis of vector-valued neural networks through the development of vector-valued variation spaces, a new class of reproducing kernel Banach spaces. These spaces emerge from studying the regularization effect of weight decay in training networks with activation functions like the rectified linear unit (ReLU). This framework offers a deeper understanding of multi-output networks and their function-space characteristics. A key contribution of this work is the development of a representer theorem for the vector-valued variation spaces. This representer theorem establishes that shallow vector-valued neural networks are the solutions to data-fitting problems over these infinite-dimensional spaces, where the network widths are bounded by the square of the number of training data. This observation reveals that the norm associated with these vector-valued variation spaces encourages the learning of features that are useful for multiple tasks, shedding new light on multi-task learning with neural networks. Finally, this paper develops a connection between weight-decay regularization and the multi-task lasso problem. This connection leads to novel bounds for layer widths in deep networks that depend on the intrinsic dimensions of the training data representations. This insight not only deepens the understanding of the deep network architectural requirements, but also yields a simple convex optimization method for deep neural network compression. The performance of this compression procedure is evaluated on various architectures.
Individual-centered Partial Information in Social Networks
http://jmlr.org/papers/v25/23-0005.html
http://jmlr.org/papers/volume25/23-0005/23-0005.pdf
2024Xiao Han, Y. X. Rachel Wang, Qing Yang, Xin Tong
In statistical network analysis, we often assume either the full network is available or multiple subgraphs can be sampled to estimate various global properties of the network. However, in a real social network, people frequently make decisions based on their local view of the network alone. Here, we consider a partial information framework that characterizes the local network centered at a given individual by path length $L$ and gives rise to a partial adjacency matrix. Under $L=2$, we focus on the problem of (global) community detection using the popular stochastic block model (SBM) and its degree-corrected variant (DCSBM). We derive theoretical properties of the eigenvalues and eigenvectors from the signal term of the partial adjacency matrix and propose new spectral-based community detection algorithms that achieve consistency under appropriate conditions. Our analysis also allows us to propose a new centrality measure that assesses the importance of an individual's partial information in determining global community structure. Using simulated and real networks, we demonstrate the performance of our algorithms and compare our centrality measure with other popular alternatives to show it captures unique nodal information. Our results illustrate that the partial information framework enables us to compare the viewpoints of different individuals regarding the global structure.
Data-driven Automated Negative Control Estimation (DANCE): Search for, Validation of, and Causal Inference with Negative Controls
http://jmlr.org/papers/v25/22-1062.html
http://jmlr.org/papers/volume25/22-1062/22-1062.pdf
2024Erich Kummerfeld, Jaewon Lim, Xu Shi
Negative control variables are increasingly used to adjust for unmeasured confounding bias in causal inference using observational data. They are typically identified by subject matter knowledge and there is currently a severe lack of data-driven methods to find negative controls. In this paper, we present a statistical test for discovering negative controls of a special type---disconnected negative controls---that can serve as surrogates of the unmeasured confounder, and we incorporate that test into the Data-driven Automated Negative Control Estimation (DANCE) algorithm. DANCE first uses the new validation test to identify subsets of a set of candidate negative control variables that satisfy the assumptions of disconnected negative controls. It then applies a negative control method to each pair of these validated negative control variables, and aggregates the output to produce an unbiased point estimate and confidence interval for a causal effect in the presence of unmeasured confounding. We (1) prove the correctness of this validation test, and thus of DANCE; (2) demonstrate via simulation experiments that DANCE outperforms both naive analysis ignoring unmeasured confounding and negative control method with randomly selected candidate negative controls; and (3) demonstrate the effectiveness of DANCE on a challenging real-world problem.
Continuous Prediction with Experts' Advice
http://jmlr.org/papers/v25/22-0803.html
http://jmlr.org/papers/volume25/22-0803/22-0803.pdf
2024Nicholas J. A. Harvey, Christopher Liaw, Victor S. Portella
Prediction with experts' advice is one of the most fundamental problems in online learning and captures many of its technical challenges. A recent line of work has looked at online learning through the lens of differential equations and continuous-time analysis. This viewpoint has yielded optimal results for several problems in online learning.
In this paper, we employ continuous-time stochastic calculus in order to study the discrete-time experts' problem. We use these tools to design a continuous-time, parameter-free algorithm with improved guarantees on the quantile regret. We then develop an analogous discrete-time algorithm with a very similar analysis and identical quantile regret bounds. Finally, we design an anytime continuous-time algorithm with regret matching the optimal fixed-time rate when the gains are independent Brownian motions; in many settings, this is the most difficult case. This gives some evidence that, even with adversarial gains, the optimal anytime and fixed-time regrets may coincide.
Memory-Efficient Sequential Pattern Mining with Hybrid Tries
http://jmlr.org/papers/v25/22-0125.html
http://jmlr.org/papers/volume25/22-0125/22-0125.pdf
2024Amin Hosseininasab, Willem-Jan van Hoeve, Andre A. Cire
This paper develops a memory-efficient approach for Sequential Pattern Mining (SPM), a fundamental topic in knowledge discovery that faces a well-known memory bottleneck for large data sets. Our methodology involves a novel hybrid trie data structure that exploits recurring patterns to compactly store the data set in memory; and a corresponding mining algorithm designed to effectively extract patterns from this compact representation. Numerical results on small to medium-sized real-life test instances show an average improvement of 85% in memory consumption and 49% in computation time compared to the state of the art. For large data sets, our algorithm stands out as the only capable SPM approach within 256GB of system memory, potentially saving 1.7TB in memory consumption.
Sample Complexity of Neural Policy Mirror Descent for Policy Optimization on Low-Dimensional Manifolds
http://jmlr.org/papers/v25/24-0066.html
http://jmlr.org/papers/volume25/24-0066/24-0066.pdf
2024Zhenghao Xu, Xiang Ji, Minshuo Chen, Mengdi Wang, Tuo Zhao
Policy gradient methods equipped with deep neural networks have achieved great success in solving high-dimensional reinforcement learning (RL) problems. However, current analyses cannot explain why they are resistant to the curse of dimensionality.
In this work, we study the sample complexity of the neural policy mirror descent (NPMD) algorithm with deep convolutional neural networks (CNN). Motivated by the empirical observation that many high-dimensional environments have state spaces possessing low-dimensional structures, such as those taking images as states, we consider the state space to be a $d$-dimensional manifold embedded in the $D$-dimensional Euclidean space with intrinsic dimension $d\ll D$.
We show that in each iteration of NPMD, both the value function and the policy can be well approximated by CNNs. The approximation errors are controlled by the size of the networks, and the smoothness of the previous networks can be inherited.
As a result, by properly choosing the network size and hyperparameters, NPMD can find an $\epsilon$-optimal policy with $\tilde{O}(\epsilon^{-\frac{d}{\alpha}-2})$ samples in expectation, where $\alpha\in(0,1]$ indicates the smoothness of environment.
Compared to previous work, our result exhibits that NPMD can leverage the low-dimensional structure of state space to escape from the curse of dimensionality, explaining the efficacy of deep policy gradient algorithms.
Split Conformal Prediction and Non-Exchangeable Data
http://jmlr.org/papers/v25/23-1553.html
http://jmlr.org/papers/volume25/23-1553/23-1553.pdf
2024Roberto I. Oliveira, Paulo Orenstein, Thiago Ramos, João Vitor Romano
Split conformal prediction (CP) is arguably the most popular CP method for uncertainty quantification, enjoying both academic interest and widespread deployment. However, the original theoretical analysis of split CP makes the crucial assumption of data exchangeability, which hinders many real-world applications. In this paper, we present a novel theoretical framework based on concentration inequalities and decoupling properties of the data, proving that split CP remains valid for many non-exchangeable processes by adding a small coverage penalty. Through experiments with both real and synthetic data, we show that our theoretical results translate to good empirical performance under non-exchangeability, e.g., for time series and spatiotemporal data. Compared to recent conformal algorithms designed to counter specific exchangeability violations, we show that split CP is competitive in terms of coverage and interval size, with the benefit of being extremely simple and orders of magnitude faster than alternatives.
Structured Dynamic Pricing: Optimal Regret in a Global Shrinkage Model
http://jmlr.org/papers/v25/23-1365.html
http://jmlr.org/papers/volume25/23-1365/23-1365.pdf
2024Rashmi Ranjan Bhuyan, Adel Javanmard, Sungchul Kim, Gourab Mukherjee, Ryan A. Rossi, Tong Yu, Handong Zhao
We consider dynamic pricing strategies in a streamed longitudinal data set-up where the objective is to maximize, over time, the cumulative profit across a large number of customer segments. We consider a dynamic model with the consumers’ preferences as well as price sensitivity varying over time. Building on the well-known finding that consumers sharing similar characteristics act in similar ways, we consider a global shrinkage structure, which assumes that the consumers’ preferences across the different segments can be well approximated by a spatial autoregressive (SAR) model. In such a streamed longitudinal setup, we measure the performance of a dynamic pricing policy via regret, which is the expected revenue loss compared to a clairvoyant that knows the sequence of model parameters in advance. We propose a pricing policy based on penalized stochastic gradient descent (PSGD) and explicitly characterize its regret as functions of time, the temporal variability in the model parameters as well as the strength of the auto-correlation network structure spanning the varied customer segments. Our regret analysis results not only demonstrate asymptotic optimality of the proposed policy but also show that for policy planning it is essential to incorporate available structural information as policies based on unshrunken models are highly sub-optimal in the aforementioned set-up. We conduct simulation experiments across a wide range of regimes as well as real-world networks based studies and report encouraging performance for our proposed method.
Sparse Graphical Linear Dynamical Systems
http://jmlr.org/papers/v25/23-0878.html
http://jmlr.org/papers/volume25/23-0878/23-0878.pdf
2024Emilie Chouzenoux, Victor Elvira
Time-series datasets are central in machine learning with applications in numerous fields of science and engineering, such as biomedicine, Earth observation, and network analysis. Extensive research exists on state-space models (SSMs), which are powerful mathematical tools that allow for probabilistic and interpretable learning on time series. Learning the model parameters in SSMs is arguably one of the most complicated tasks, and the inclusion of prior knowledge is known to both ease the interpretation but also to complicate the inferential tasks. Very recent works have attempted to incorporate a graphical perspective on some of those model parameters, but they present notable limitations that this work addresses. More generally, existing graphical modeling tools are designed to incorporate either static information, focusing on statistical dependencies among independent random variables (e.g., graphical Lasso approach), or dynamic information, emphasizing causal relationships among time series samples (e.g., graphical Granger approaches). However, there are no joint approaches combining static and dynamic graphical modeling within the context of SSMs. This work proposes a novel approach to fill this gap by introducing a joint graphical modeling framework that bridges the graphical Lasso model and a causal-based graphical approach for the linear-Gaussian SSM. We present DGLASSO (Dynamic Graphical Lasso), a new inference method within this framework that implements an efficient block alternating majorization-minimization algorithm. The algorithm's convergence is established by departing from modern tools from nonlinear analysis. Experimental validation on various synthetic data showcases the effectiveness of the proposed model and inference algorithm. This work will significantly contribute to the understanding and utilization of time-series data in diverse scientific and engineering applications where incorporating a graphical approach is essential to perform the inference.
Statistical analysis for a penalized EM algorithm in high-dimensional mixture linear regression model
http://jmlr.org/papers/v25/23-0296.html
http://jmlr.org/papers/volume25/23-0296/23-0296.pdf
2024Ning Wang, Xin Zhang, Qing Mai
The expectation-maximization (EM) algorithm and its variants are widely used in statistics. In high-dimensional mixture linear regression, the model is assumed to be a finite mixture of linear regression and the number of predictors is much larger than the sample size. The standard EM algorithm, which attempts to find the maximum likelihood estimator, becomes infeasible for such model. We devise a group lasso penalized EM algorithm and study its statistical properties. Existing theoretical results of regularized EM algorithms often rely on dividing the sample into many independent batches and employing a fresh batch of sample in each iteration of the algorithm. Our algorithm and theoretical analysis do not require sample-splitting, and can be extended to multivariate response cases. The proposed methods also have encouraging performances in numerical studies.
Bridging Distributional and Risk-sensitive Reinforcement Learning with Provable Regret Bounds
http://jmlr.org/papers/v25/22-1253.html
http://jmlr.org/papers/volume25/22-1253/22-1253.pdf
2024Hao Liang, Zhi-Quan Luo
We study the regret guarantee for risk-sensitive reinforcement learning (RSRL) via distributional reinforcement learning (DRL) methods. In particular, we consider finite episodic Markov decision processes whose objective is the entropic risk measure (EntRM) of return. By leveraging a key property of the EntRM, the independence property, we establish the risk-sensitive distributional dynamic programming framework. We then propose two novel DRL algorithms that implement optimism through two different schemes, including a model-free one and a model-based one.
We prove that they both attain $\tilde{\mathcal{O}}\left(\frac{\exp(|\beta| H)-1}{|\beta|}H\sqrt{S^2AK}\right)$ regret upper bound, where $S$, $A$, $K$, $H$, $T=KH$, and $\beta$ represent the number of states, actions, episodes, time horizon, number of total time-steps, and risk parameter respectively. It matches RSVI2, with novel distributional analysis that focuses on the distributions of returns rather than the risk values associated with these returns. To the best of our knowledge, this is the first regret analysis that bridges DRL and RSRL in terms of sample complexity.
To address the computational inefficiencies inherent in the model-free DRL algorithm, we propose an alternative DRL algorithm with distribution representation. This approach effectively represents any bounded distribution using a refined distribution class. It significantly amplifies computational efficiency while maintaining the established regret bounds.
We also prove a tighter minimax lower bound of $\Omega\left(\frac{\exp(\beta H/6)-1}{\beta }\sqrt{SAT}\right)$ for the $\beta>0$ case, which recovers the tight lower bound $\Omega(H\sqrt{SAT})$ in the risk-neutral setting.
Low-Rank Matrix Estimation in the Presence of Change-Points
http://jmlr.org/papers/v25/22-0852.html
http://jmlr.org/papers/volume25/22-0852/22-0852.pdf
2024Lei Shi, Guanghui Wang, Changliang Zou
We consider a general trace regression model with multiple structural changes and propose a universal approach for simultaneous exact or near-low-rank matrix recovery and change-point detection. It incorporates nuclear norm penalized least-squares minimization into a grid search scheme that determines the potential structural break. Under a set of general conditions, we establish the non-asymptotic error bounds with a nearly-oracle rate for the matrix estimators as well as the super-consistency rate for the change-point localization. We use concrete random design instances to justify the appropriateness of the proposed conditions. Numerical results demonstrate the validity and effectiveness of the proposed scheme.
A Framework for Improving the Reliability of Black-box Variational Inference
http://jmlr.org/papers/v25/22-0327.html
http://jmlr.org/papers/volume25/22-0327/22-0327.pdf
2024Manushi Welandawe, Michael Riis Andersen, Aki Vehtari, Jonathan H. Huggins
Black-box variational inference (BBVI) now sees widespread use in machine learning and statistics as a fast yet flexible alternative to Markov chain Monte Carlo methods for approximate Bayesian inference. However, stochastic optimization methods for BBVI remain unreliable and require substantial expertise and hand-tuning to apply effectively. In this paper, we propose robust and automated black-box VI (RABVI), a framework for improving the reliability of BBVI optimization. RABVI is based on rigorously justified automation techniques, includes just a small number of intuitive tuning parameters, and detects inaccurate estimates of the optimal variational approximation. RABVI adaptively decreases the learning rate by detecting convergence of the fixed--learning-rate iterates, then estimates the symmetrized Kullback--Leibler (KL) divergence between the current variational approximation and the optimal one. It also employs a novel optimization termination criterion that enables the user to balance desired accuracy against computational cost by comparing (i) the predicted relative decrease in the symmetrized KL divergence if a smaller learning were used and (ii) the predicted computation required to converge with the smaller learning rate. We validate the robustness and accuracy of RABVI through carefully designed simulation studies and on a diverse set of real-world model and data examples.
Understanding Entropic Regularization in GANs
http://jmlr.org/papers/v25/21-1295.html
http://jmlr.org/papers/volume25/21-1295/21-1295.pdf
2024Daria Reshetova, Yikun Bai, Xiugang Wu, Ayfer Özgür
Generative Adversarial Networks (GANs) are a popular method for learning distributions from data by modeling the target distribution as a function of a known distribution. The function, often referred to as the generator, is optimized to minimize a chosen distance measure between the generated and target distributions. One commonly used measure for this purpose is the Wasserstein distance. However, Wasserstein distance is hard to compute and optimize, and in practice entropic regularization techniques are used to facilitate its computation and improve numerical convergence. The influence of regularization on the learned solution, however, remains not well-understood. In this paper, we study how several popular entropic regularizations of Wasserstein distance impact the solution learned by a Wasserstein GAN in a simple benchmark setting where the generator is linear and the target distribution is high-dimensional Gaussian. We show that entropy regularization of Wasserstein distance promotes sparsification of the solution, while replacing the Wasserstein distance with the Sinkhorn divergence recovers the unregularized solution. The significant benefit of both regularization techniques is that they remove the curse of dimensionality suffered by Wasserstein distance. We show that in both cases the optimal generator can be learned to accuracy $\epsilon$ with $O(1/\epsilon^2)$ samples from the target distribution without requiring to constrain the discriminator. We thus conclude that these regularization techniques can improve the quality of the generator learned from empirical data in a way that is applicable for a large class of distributions.
BenchMARL: Benchmarking Multi-Agent Reinforcement Learning
http://jmlr.org/papers/v25/23-1612.html
http://jmlr.org/papers/volume25/23-1612/23-1612.pdf
2024Matteo Bettini, Amanda Prorok, Vincent Moens
The field of Multi-Agent Reinforcement Learning (MARL) is currently facing a reproducibility crisis. While solutions for standardized reporting have been proposed to address the issue, we still lack a benchmarking tool that enables standardization and reproducibility, while leveraging cutting-edge Reinforcement Learning (RL) implementations. In this paper, we introduce BenchMARL, the first MARL training library created to enable standardized benchmarking across different algorithms, models, and environments. BenchMARL uses TorchRL as its backend, granting it high-performance and maintained state-of-the-art implementations while addressing the broad community of MARL PyTorch users. Its design enables systematic configuration and reporting, thus allowing users to create and run complex benchmarks from simple one-line inputs. BenchMARL is open-sourced on GitHub at https://github.com/facebookresearch/BenchMARL
Learning from many trajectories
http://jmlr.org/papers/v25/23-1145.html
http://jmlr.org/papers/volume25/23-1145/23-1145.pdf
2024Stephen Tu, Roy Frostig, Mahdi Soltanolkotabi
We initiate a study of supervised learning from many independent sequences ("trajectories") of non-independent covariates, reflecting tasks in sequence modeling, control, and reinforcement learning. Conceptually, our multi-trajectory setup sits between two traditional settings in statistical learning theory: learning from independent examples and learning from a single auto-correlated sequence. Our conditions for efficient learning generalize the former setting---trajectories must be non-degenerate in ways that extend standard requirements for independent examples. Notably, we do not require that trajectories be ergodic, long, nor strictly stable. For linear least-squares regression, given $n$-dimensional examples produced by $m$ trajectories, each of length $T$, we observe a notable change in statistical efficiency as the number of trajectories increases from a few (namely $m \lesssim n$) to many (namely $m \gtrsim n$). Specifically, we establish that the worst-case error rate of this problem is $\Theta(n / m T)$ whenever $m \gtrsim n$. Meanwhile, when $m \lesssim n$, we establish a (sharp) lower bound of $\Omega(n^2 / m^2 T)$ on the worst-case error rate, realized by a simple, marginally unstable linear dynamical system. A key upshot is that, in domains where trajectories regularly reset, the error rate eventually behaves as if all of the examples were independent, drawn from their marginals. As a corollary of our analysis, we also improve guarantees for the linear system identification problem.
Interpretable algorithmic fairness in structured and unstructured data
http://jmlr.org/papers/v25/23-0816.html
http://jmlr.org/papers/volume25/23-0816/23-0816.pdf
2024Hari Bandi, Dimitris Bertsimas, Thodoris Koukouvinos, Sofie Kupiec
Systemic bias with respect to gender and race is prevalent in datasets, making it challenging to train classification models that are accurate and alleviate bias. We propose a unified method for alleviating bias in structured and unstructured data, based on a novel optimization approach for optimally flipping outcome labels and training classification models simultaneously. In the case of structured data, we introduce constraints on selected objective measures of meritocracy, and present four case studies, demonstrating that our approach often outperforms state-of the art methods in terms of fairness and meritocracy. In the case of unstructured data, we present two case studies on image classification, demonstrating that our method outperforms state-of-the-art approaches in terms of fairness. Moreover, we note that the decrease in accuracy over the nominal model is $3.31 \%$ on structured data and $0.65 \%$ on unstructured data. Finally, we leverage Optimal Classification Trees (OCTs), to provide insights on which attributes of individuals lead to flipping of their labels and apply it to interpret the flipping decisions on structured data. Utilizing OCTs with auxiliary tabular data as well as Gradient-weighted Class Activation Mapping (Grad-CAM), we provide insights on the flipping decisions for unstructured data.
FedCBO: Reaching Group Consensus in Clustered Federated Learning through Consensus-based Optimization
http://jmlr.org/papers/v25/23-0764.html
http://jmlr.org/papers/volume25/23-0764/23-0764.pdf
2024José A. Carrillo, Nicolás García Trillos, Sixu Li, Yuhua Zhu
Federated learning is an important framework in modern machine learning that seeks to integrate the training of learning models from multiple users, each user having their own local data set, in a way that is sensitive to data privacy and to communication loss constraints. In clustered federated learning, one assumes an additional unknown group structure among users, and the goal is to train models that are useful for each group, rather than simply training a single global model for all users. In this paper, we propose a novel solution to the problem of clustered federated learning that is inspired by ideas in consensus-based optimization (CBO). Our new CBO-type method is based on a system of interacting particles that is oblivious to group memberships. Our model is motivated by rigorous mathematical reasoning, which includes a mean-field analysis describing the large number of particles limit of our particle system, as well as convergence guarantees for the simultaneous global optimization of general non-convex objective functions (corresponding to the loss functions of each cluster of users) in the mean-field regime. Experimental results demonstrate the efficacy of our FedCBO algorithm compared to other state-of-the-art methods and help validate our methodological and theoretical work.
On the Connection between Lp- and Risk Consistency and its Implications on Regularized Kernel Methods
http://jmlr.org/papers/v25/23-0397.html
http://jmlr.org/papers/volume25/23-0397/23-0397.pdf
2024Hannes Köhler
As a predictor's quality is often assessed by means of its risk, it is natural to regard risk consistency as a desirable property of learning methods, and many such methods have indeed been shown to be risk consistent. The first aim of this paper is to establish the close connection between risk consistency and $L_p$-consistency for a considerably wider class of loss functions than has been done before. The attempt to transfer this connection to shifted loss functions surprisingly reveals that this shift does not reduce the assumptions needed on the underlying probability measure to the same extent as it does for many other results. The results are applied to regularized kernel methods such as support vector machines.
Pre-trained Gaussian Processes for Bayesian Optimization
http://jmlr.org/papers/v25/23-0269.html
http://jmlr.org/papers/volume25/23-0269/23-0269.pdf
2024Zi Wang, George E. Dahl, Kevin Swersky, Chansoo Lee, Zachary Nado, Justin Gilmer, Jasper Snoek, Zoubin Ghahramani
Bayesian optimization (BO) has become a popular strategy for global optimization of expensive real-world functions. Contrary to a common expectation that BO is suited to optimizing black-box functions, it actually requires domain knowledge about those functions to deploy BO successfully. Such domain knowledge often manifests in Gaussian process (GP) priors that specify initial beliefs on functions. However, even with expert knowledge, it is non-trivial to quantitatively define a prior. This is especially true for hyperparameter tuning problems on complex machine learning models, where landscapes of tuning objectives are often difficult to comprehend. We seek an alternative practice for setting these functional priors. In particular, we consider the scenario where we have data from similar functions that allow us to pre-train a tighter distribution a priori.
We detail what pre-training entails for GPs using a KL divergence based loss function, and propose a new pre-training based BO framework named HyperBO. Theoretically, we show bounded posterior predictions and near-zero regrets for HyperBO without assuming the "ground truth" GP prior is known. To verify our approach in realistic setups, we collect a large multi-task hyperparameter tuning dataset by training tens of thousands of configurations of near-state-of-the-art deep learning models on popular image and text datasets, as well as a protein sequence dataset. Our results show that on average, HyperBO is able to locate good hyperparameters at least 3 times more efficiently than the best competing methods on both our new tuning dataset and existing multi-task BO benchmarks.
Heterogeneity-aware Clustered Distributed Learning for Multi-source Data Analysis
http://jmlr.org/papers/v25/23-0059.html
http://jmlr.org/papers/volume25/23-0059/23-0059.pdf
2024Yuanxing Chen, Qingzhao Zhang, Shuangge Ma, Kuangnan Fang
In diverse fields ranging from finance to omics, it is increasingly common that data is distributed with multiple individual sources (referred to as “clients” in some studies). Integrating raw data, although powerful, is often not feasible, for example, when there are considerations on privacy protection. Distributed learning techniques have been developed to integrate summary statistics as opposed to raw data. In many existing distributed learning studies, it is stringently assumed that all the clients have the same model. To accommodate data heterogeneity, some federated learning methods allow for client-specific models. In this article, we consider the scenario that clients form clusters, those in the same cluster have the same model, and different clusters have different models. Further considering the clustering structure can lead to a better understanding of the “interconnections” among clients and reduce the number of parameters. To this end, we develop a novel penalization approach. Specifically, group penalization is imposed for regularized estimation and selection of important variables, and fusion penalization is imposed to automatically cluster clients. An effective ADMM algorithm is developed, and the estimation, selection, and clustering consistency properties are established under mild conditions. Simulation and data analysis further demonstrate the practical utility and superiority of the proposed approach.
From Small Scales to Large Scales: Distance-to-Measure Density based Geometric Analysis of Complex Data
http://jmlr.org/papers/v25/22-1344.html
http://jmlr.org/papers/volume25/22-1344/22-1344.pdf
2024Katharina Proksch, Christoph Alexander Weikamp, Thomas Staudt, Benoit Lelandais, Christophe Zimmer
How can we tell complex point clouds with different small scale characteristics apart, while disregarding global features? Can we find a suitable transformation of such data in a way that allows to discriminate between differences in this sense with statistical guarantees? In this paper, we consider the analysis and classification of complex point clouds as they are obtained, e.g., via single molecule localization microscopy. We focus on the task of identifying differences between noisy point clouds based on small scale characteristics, while disregarding large scale information such as overall size. We propose an approach based on a transformation of the data via the so-called Distance-to-Measure (DTM) function, a transformation which is based on the average of nearest neighbor distances. For each data set, we estimate the probability density of average local distances of all data points and use the estimated densities for classification. While the applicability is immediate and the practical performance of the proposed methodology is very good, the theoretical study of the density estimators is quite challenging, as they are based on non-i.i.d. observations that have been obtained via a complicated transformation. In fact, the transformed data are stochastically dependent in a non-local way that is not captured by commonly considered dependence measures. Nonetheless, we show that the asymptotic behaviour of the density estimator is driven by a kernel density estimator of certain i.i.d. random variables by using theoretical properties of $U$-statistics, which allows to handle the dependencies via a Hoeffding decomposition. We show via a numerical study and in an application to simulated single molecule localization microscopy data of chromatin fibers that unsupervised classification tasks based on estimated DTM-densities achieve excellent separation results.
PAMI: An Open-Source Python Library for Pattern Mining
http://jmlr.org/papers/v25/22-1026.html
http://jmlr.org/papers/volume25/22-1026/22-1026.pdf
2024Uday Kiran Rage, Veena Pamalla, Masashi Toyoda, Masaru Kitsuregawa
Crucial information that can empower users with competitive information to achieve socio-economic development lies hidden in big data. Pattern mining aims to discover this needy information by finding user interest-based patterns in big data. Unfortunately, existing pattern mining libraries are limited to finding a few types of patterns in transactional and sequence databases. This paper tackles this problem by providing a cross-platform open-source Python library called PAttern MIning (PAMI). PAMI provides several algorithms to discover different types of patterns hidden in various types of databases across multiple computing architectures. PAMI also contains algorithms to generate various types of synthetic databases. PAMI offers a command line interface, Jupyter Notebook support, and easy maintenance through the Python Package Index. Furthermore, the source code is available under the GNU General Public License, version 3. Finally, PAMI offers several resources, such as a user's guide, a developer's guide, datasets, and a bug report.
Law of Large Numbers and Central Limit Theorem for Wide Two-layer Neural Networks: The Mini-Batch and Noisy Case
http://jmlr.org/papers/v25/22-0952.html
http://jmlr.org/papers/volume25/22-0952/22-0952.pdf
2024Arnaud Descours, Arnaud Guillin, Manon Michel, Boris Nectoux
In this work, we consider a wide two-layer neural network and study the behavior of its empirical weights under a dynamics set by a stochastic gradient descent along the quadratic loss with mini-batches and noise. Our goal is to prove a trajectorial law of large number as well as a central limit theorem for their evolution. When the noise is scaling as $1/N^\beta$ and $1/2<\beta\le\infty$, we rigorously derive and generalize the LLN obtained for example by Rotskoff and Van den Injden (Com. Pure. Appl. Math, 2022), Mei and Montanari and Nguyen (Pnas 2018) or Sirignano and Spiliopoulos (Siam. J. Appl. Math. 2020). When $3/4<\beta\le\infty$, we also generalize the CLT of Sirignano and Spiliopoulos (Stoch. Proc. Appl. 2020) and further exhibit the effect of mini-batching on the asymptotic variance which leads the fluctuations. The case $\beta=3/4$ is trickier and we give an example showing the divergence with time of the variance thus establishing the instability of the predictions of the neural network in this case. It is illustrated by simple numerical examples.
Risk Measures and Upper Probabilities: Coherence and Stratification
http://jmlr.org/papers/v25/22-0641.html
http://jmlr.org/papers/volume25/22-0641/22-0641.pdf
2024Christian Fröhlich, Robert C. Williamson
Machine learning typically presupposes classical probability theory which implies that aggregation is built upon expectation. There are now multiple reasons to motivate looking at richer alternatives to classical probability theory as a mathematical foundation for machine learning. We systematically examine a powerful and rich class of alternative aggregation functionals, known variously as spectral risk measures, Choquet integrals or Lorentz norms. We present a range of characterization results, and demonstrate what makes this spectral family so special. In doing so we arrive at a natural stratification of all coherent risk measures in terms of the upper probabilities that they induce by exploiting results from the theory of rearrangement invariant Banach spaces. We empirically demonstrate how this new approach to uncertainty helps tackling practical machine learning problems.
Parallel-in-Time Probabilistic Numerical ODE Solvers
http://jmlr.org/papers/v25/23-1261.html
http://jmlr.org/papers/volume25/23-1261/23-1261.pdf
2024Nathanael Bosch, Adrien Corenflos, Fatemeh Yaghoobi, Filip Tronarp, Philipp Hennig, Simo Särkkä
Probabilistic numerical solvers for ordinary differential equations (ODEs) treat the numerical simulation of dynamical systems as problems of Bayesian state estimation. Aside from producing posterior distributions over ODE solutions and thereby quantifying the numerical approximation error of the method itself, one less-often noted advantage of this formalism is the algorithmic flexibility gained by formulating numerical simulation in the framework of Bayesian filtering and smoothing. In this paper, we leverage this flexibility and build on the time-parallel formulation of iterated extended Kalman smoothers to formulate a parallel-in-time probabilistic numerical ODE solver. Instead of simulating the dynamical system sequentially in time, as done by current probabilistic solvers, the proposed method processes all time steps in parallel and thereby reduces the computational complexity from linear to logarithmic in the number of time steps. We demonstrate the effectiveness of our approach on a variety of ODEs and compare it to a range of both classic and probabilistic numerical ODE solvers.
Scalable High-Dimensional Multivariate Linear Regression for Feature-Distributed Data
http://jmlr.org/papers/v25/23-0882.html
http://jmlr.org/papers/volume25/23-0882/23-0882.pdf
2024Shuo-Chieh Huang, Ruey S. Tsay
Feature-distributed data, referred to data partitioned by features and stored across multiple computing nodes, are increasingly common in applications with a large number of features. This paper proposes a two-stage relaxed greedy algorithm (TSRGA) for applying multivariate linear regression to such data. The main advantage of TSRGA is that its communication complexity does not depend on the feature dimension, making it highly scalable to very large data sets. In addition, for multivariate response variables, TSRGA can be used to yield low-rank coefficient estimates. The fast convergence of TSRGA is validated by simulation experiments. Finally, we apply the proposed TSRGA in a financial application that leverages unstructured data from the 10-K reports, demonstrating its usefulness in applications with many dense large-dimensional matrices.
Dropout Regularization Versus l2-Penalization in the Linear Model
http://jmlr.org/papers/v25/23-0803.html
http://jmlr.org/papers/volume25/23-0803/23-0803.pdf
2024Gabriel Clara, Sophie Langer, Johannes Schmidt-Hieber
We investigate the statistical behavior of gradient descent iterates with dropout in the linear regression model. In particular, non-asymptotic bounds for the convergence of expectations and covariance matrices of the iterates are derived. The results shed more light on the widely cited connection between dropout and $\ell_2$-regularization in the linear model. We indicate a more subtle relationship, owing to interactions between the gradient descent dynamics and the additional randomness induced by dropout. Further, we study a simplified variant of dropout which does not have a regularizing effect and converges to the least squares estimator.
Efficient Convex Algorithms for Universal Kernel Learning
http://jmlr.org/papers/v25/23-0528.html
http://jmlr.org/papers/volume25/23-0528/23-0528.pdf
2024Aleksandr Talitckii, Brendon Colbert, Matthew M. Peet
The accuracy and complexity of machine learning algorithms based on kernel optimization are determined by the set of kernels over which they are able to optimize. An ideal set of kernels should: admit a linear parameterization (for tractability); be dense in the set of all kernels (for robustness); be universal (for accuracy). Recently, a framework was proposed for using positive matrices to parameterize a class of positive semi-separable kernels. Although this class can be shown to meet all three criteria, previous algorithms for optimization of such kernels were limited to classification and furthermore relied on computationally complex Semidefinite Programming (SDP) algorithms. In this paper, we pose the problem of learning semiseparable kernels as a minimax optimization problem and propose a SVD-QCQP primal-dual algorithm which dramatically reduces the computational complexity as compared with previous SDP-based approaches. Furthermore, we provide an efficient implementation of this algorithm for both classification and regression -- an implementation which enables us to solve problems with 100 features and up to 30,000 datums. Finally, when applied to benchmark data, the algorithm demonstrates the potential for significant improvement in accuracy over typical (but non-convex) approaches such as Neural Nets and Random Forest with similar or better computation time.
Manifold Learning by Mixture Models of VAEs for Inverse Problems
http://jmlr.org/papers/v25/23-0396.html
http://jmlr.org/papers/volume25/23-0396/23-0396.pdf
2024Giovanni S. Alberti, Johannes Hertrich, Matteo Santacesaria, Silvia Sciutto
Representing a manifold of very high-dimensional data with generative models has been shown to be computationally efficient in practice. However, this requires that the data manifold admits a global parameterization. In order to represent manifolds of arbitrary topology, we propose to learn a mixture model of variational autoencoders. Here, every encoder-decoder pair represents one chart of a manifold. We propose a loss function for maximum likelihood estimation of the model weights and choose an architecture that provides us the analytical expression of the charts and of their inverses. Once the manifold is learned, we use it for solving inverse problems by minimizing a data fidelity term restricted to the learned manifold. To solve the arising minimization problem we propose a Riemannian gradient descent algorithm on the learned manifold. We demonstrate the performance of our method for low-dimensional toy examples as well as for deblurring and electrical impedance tomography on certain image manifolds.
An Algorithmic Framework for the Optimization of Deep Neural Networks Architectures and Hyperparameters
http://jmlr.org/papers/v25/23-0166.html
http://jmlr.org/papers/volume25/23-0166/23-0166.pdf
2024Julie Keisler, El-Ghazali Talbi, Sandra Claudel, Gilles Cabriel
In this paper, we propose DRAGON (for DiRected Acyclic Graph OptimizatioN), an algorithmic framework to automatically generate efficient deep neural networks architectures and optimize their associated hyperparameters. The framework is based on evolving Directed Acyclic Graphs (DAGs), defining a more flexible search space than the existing ones in the literature. It allows mixtures of different classical operations: convolutions, recurrences and dense layers, but also more newfangled operations such as self-attention. Based on this search space we propose neighbourhood and evolution search operators to optimize both the architecture and hyper-parameters of our networks. These search operators can be used with any metaheuristic capable of handling mixed search spaces. We tested our algorithmic framework with an asynchronous evolutionary algorithm on a time series forecasting benchmark. The results demonstrate that DRAGON outperforms state-of-the-art handcrafted models and AutoML techniques for time series forecasting on numerous datasets. DRAGON has been implemented as a python open-source package.
Distributionally Robust Model-Based Offline Reinforcement Learning with Near-Optimal Sample Complexity
http://jmlr.org/papers/v25/22-1482.html
http://jmlr.org/papers/volume25/22-1482/22-1482.pdf
2024Laixi Shi, Yuejie Chi
This paper concerns the central issues of model robustness and sample efficiency in offline reinforcement learning (RL), which aims to learn to perform decision making from history data without active exploration. Due to uncertainties and variabilities of the environment, it is critical to learn a robust policy---with as few samples as possible---that performs well even when the deployed environment deviates from the nominal one used to collect the history dataset. We consider a distributionally robust formulation of offline RL, focusing on tabular robust Markov decision processes with an uncertainty set specified by the Kullback-Leibler divergence in both finite-horizon and infinite-horizon settings. To combat with sample scarcity, a model-based algorithm that combines distributionally robust value iteration with the principle of pessimism in the face of uncertainty is proposed, by penalizing the robust value estimates with a carefully designed data-driven penalty term. Under a mild and tailored assumption of the history dataset that measures distribution shift without requiring full coverage of the state-action space, we establish the finite-sample complexity of the proposed algorithms. We further develop an information-theoretic lower bound, which suggests that learning RMDPs is at least as hard as the standard MDPs when the uncertainty level is sufficient small, and corroborates the tightness of our upper bound up to polynomial factors of the (effective) horizon length for a range of uncertainty levels. To the best our knowledge, this provides the first provably near-optimal robust offline RL algorithm that learns under model uncertainty and partial coverage.
Grokking phase transitions in learning local rules with gradient descent
http://jmlr.org/papers/v25/22-1228.html
http://jmlr.org/papers/volume25/22-1228/22-1228.pdf
2024Bojan Žunkovič, Enej Ilievski
We discuss two solvable grokking (generalisation beyond overfitting) models in a rule-learning scenario. We show that grokking is a phase transition and find exact analytic expressions for the critical exponents, grokking probability, and grokking time distribution. Further, we introduce a tensor network map that connects the proposed grokking setup with the standard (perceptron) statistical learning theory and provide evidence that grokking is a consequence of the locality of the teacher model. We analyze the rule-30 cellular automaton learning task, numerically determine the critical exponent and the grokking time distribution, and compare them with the prediction of the proposed grokking model. Finally, we numerically study the connection between structure formation and grokking.
Unsupervised Tree Boosting for Learning Probability Distributions
http://jmlr.org/papers/v25/22-0980.html
http://jmlr.org/papers/volume25/22-0980/22-0980.pdf
2024Naoki Awaya, Li Ma
We propose an unsupervised tree boosting algorithm for inferring the underlying sampling distribution of an i.i.d. sample based on fitting additive tree ensembles in a manner analogous to supervised tree boosting. Integral to the algorithm is a new notion of "addition" on probability distributions that leads to a coherent notion of "residualization", i.e., subtracting a probability distribution from an observation to remove the distributional structure from the sampling distribution of the latter. We show that these notions arise naturally for univariate distributions through cumulative distribution function (CDF) transforms and compositions due to several "group-like" properties of univariate CDFs. While the traditional multivariate CDF does not preserve these properties, a new definition of multivariate CDF can restore these properties, thereby allowing the notions of "addition" and "residualization" to be formulated for multivariate settings as well. This then gives rise to the unsupervised boosting algorithm based on forward-stagewise fitting of an additive tree ensemble, which sequentially reduces the Kullback-Leibler divergence from the truth. The algorithm allows analytic evaluation of the fitted density and outputs a generative model that can be readily sampled from. We enhance the algorithm with scale-dependent shrinkage and a two-stage strategy that separately fits the marginals and the copula. The algorithm then performs competitively with state-of-the-art deep-learning approaches in multivariate density estimation on multiple benchmark data sets.
Linear Regression With Unmatched Data: A Deconvolution Perspective
http://jmlr.org/papers/v25/22-0930.html
http://jmlr.org/papers/volume25/22-0930/22-0930.pdf
2024Mona Azadkia, Fadoua Balabdaoui
Consider the regression problem where the response $Y\in\mathbb{R}$ and the covariate $X\in\mathbb{R}^d$ for $d\geq 1$ are unmatched. Under this scenario, we do not have access to pairs of observations from the distribution of $(X, Y)$, but instead, we have separate data sets $\{Y_i\}_{i=1}^{n_Y}$ and $\{X_j\}_{j=1}^{n_X}$, possibly collected from different sources. We study this problem assuming that the regression function is linear and the noise distribution is known, an assumption that we relax in the applications we consider. We introduce an estimator of the regression vector based on deconvolution and demonstrate its consistency and asymptotic normality under identifiability. Even when identifiability does not hold, we show in some cases that our estimator, the DLSE (Deconvolution Least Squared Estimator), is consistent in terms of an extended $\ell_2$ norm. Using this observation, we devise a method for semi-supervised learning, i.e., when we have access to a small sample of matched pairs $\{(X_k, Y_k)\}_{k=1}^m$. Several applications with synthetic and real data sets are considered to illustrate the theory.
Training Integrable Parameterizations of Deep Neural Networks in the Infinite-Width Limit
http://jmlr.org/papers/v25/21-1260.html
http://jmlr.org/papers/volume25/21-1260/21-1260.pdf
2024Karl Hajjar, Lénaïc Chizat, Christophe Giraud
To theoretically understand the behavior of trained deep neural networks, it is necessary to study the dynamics induced by gradient methods from a random initialization. However, the nonlinear and compositional structure of these models make these dynamics difficult to analyze. To overcome these challenges, large-width asymptotics have recently emerged as a fruitful viewpoint and led to practical insights on real-world deep networks. For two-layer neural networks, it has been understood via these asymptotics that the nature of the trained model radically changes depending on the scale of the initial random weights, ranging from a kernel regime (for large initial variance) to a feature learning regime (for small initial variance). For deeper networks more regimes are possible, and in this paper we study in detail a specific choice of "small" initialization corresponding to "mean-field" limits of neural networks, which we call integrable parameterizations (IPs). First, we show that under standard i.i.d. zero-mean initialization, integrable parameterizations of neural networks with more than four layers start at a stationary point in the infinite-width limit and no learning occurs. We then propose various methods to avoid this trivial behavior and analyze in detail the resulting dynamics. In particular, one of these methods consists in using large initial learning rates, and we show that it is equivalent to a modification of the recently proposed maximal update parameterization µP. We confirm our results with numerical experiments on image classification tasks, which additionally show a strong difference in behavior between various choices of activation functions that is not yet captured by theory.
Sharp analysis of power iteration for tensor PCA
http://jmlr.org/papers/v25/24-0006.html
http://jmlr.org/papers/volume25/24-0006/24-0006.pdf
2024Yuchen Wu, Kangjie Zhou
We investigate the power iteration algorithm for the tensor PCA model introduced in Richard and Montanari (2014). Previous work studying the properties of tensor power iteration is either limited to a constant number of iterations, or requires a non-trivial data-independent initialization. In this paper, we move beyond these limitations and analyze the dynamics of randomly initialized tensor power iteration up to polynomially many steps. Our contributions are threefold: First, we establish sharp bounds on the number of iterations required for power method to converge to the planted signal, for a broad range of the signal-to-noise ratios. Second, our analysis reveals that the actual algorithmic threshold for power iteration is smaller than the one conjectured in the literature by a $\mathrm{polylog}(n)$ factor, where $n$ is the ambient dimension. Finally, we propose a simple and effective stopping criterion for power iteration, which provably outputs a solution that is highly correlated with the true signal. Extensive numerical experiments verify our theoretical results.
On the Intrinsic Structures of Spiking Neural Networks
http://jmlr.org/papers/v25/23-1526.html
http://jmlr.org/papers/volume25/23-1526/23-1526.pdf
2024Shao-Qun Zhang, Jia-Yi Chen, Jin-Hui Wu, Gao Zhang, Huan Xiong, Bin Gu, Zhi-Hua Zhou
Recent years have emerged a surge of interest in spiking neural networks (SNNs). The performance of SNNs hinges not only on searching apposite architectures and connection weights, similar to conventional artificial neural networks, but also on the meticulous configuration of their intrinsic structures. However, there has been a dearth of comprehensive studies examining the impact of intrinsic structures; thus developers often feel challenging to apply a standardized configuration of SNNs across diverse datasets or tasks. This work delves deep into the intrinsic structures of SNNs. Initially, we draw two key conclusions: (1) the membrane time hyper-parameter is intimately linked to the eigenvalues of the integration operation, dictating the functional topology of spiking dynamics; (2) various hyper-parameters of the firing-reset mechanism govern the overall firing capacity of an SNN, mitigating the injection ratio or sampling density of input data. These findings elucidate why the efficacy of SNNs hinges heavily on the configuration of intrinsic structures and lead to a recommendation that enhancing the adaptability of these structures contributes to improving the overall performance and applicability of SNNs. Inspired by this recognition, we propose two feasible approaches to enhance SNN learning, involving developing self-connection architectures and stochastic spiking neurons to augment the adaptability of the integration operation and firing-reset mechanism, respectively. We theoretically prove that (1) both methods promote the expressive property for universal approximation, (2) the incorporation of self-connection architectures fosters ample solutions and structural stability for SNNs approximating adaptive dynamical systems, (3) the stochastic spiking neurons maintain generalization bounds with an exponential reduction in Rademacher complexity. Empirical experiments conducted on various real-world datasets affirm the effectiveness of our proposed methods.
Three-Way Trade-Off in Multi-Objective Learning: Optimization, Generalization and Conflict-Avoidance
http://jmlr.org/papers/v25/23-1287.html
http://jmlr.org/papers/volume25/23-1287/23-1287.pdf
2024Lisha Chen, Heshan Fernando, Yiming Ying, Tianyi Chen
Multi-objective learning (MOL) often arises in machine learning problems when there are multiple data modalities or tasks. One critical challenge in MOL is the potential conflict among different objectives during the optimization process. Recent works have developed various dynamic weighting algorithms for MOL, where the central idea is to find an update direction that avoids conflicts among objectives. Albeit its appealing intuition, empirical studies show that dynamic weighting methods may not outperform static ones. To understand this theory-practice gap, we focus on a stochastic variant of MGDA, the Multi-objective gradient with Double sampling (MoDo), and study the generalization performance and its interplay with optimization through the lens of algorithmic stability in the framework of statistical learning theory. We find that the key rationale behind MGDA—updating along conflict-avoidant direction—may hinder dynamic weighting algorithms from achieving the optimal $O(1/\sqrt{n})$ population risk, where $n$ is the number of training samples. We further demonstrate the impact of dynamic weights on the three-way trade-off among optimization, generalization, and conflict avoidance unique in MOL. We showcase the generality of our theoretical framework by analyzing other algorithms under the framework. Experiments on various multi-task learning benchmarks are performed to demonstrate the practical applicability. Code is available at https://github.com/heshandevaka/Trade-Off-MOL.
Neural Collapse for Unconstrained Feature Model under Cross-entropy Loss with Imbalanced Data
http://jmlr.org/papers/v25/23-1215.html
http://jmlr.org/papers/volume25/23-1215/23-1215.pdf
2024Wanli Hong, Shuyang Ling
Neural Collapse (NC) is a fascinating phenomenon that arises during the terminal phase of training (TPT) of deep neural networks (DNNs). Specifically, for balanced training datasets (each class shares the same number of samples), it is observed that the feature vectors of samples from the same class converge to their corresponding in-class mean features and their pairwise angles are the same. In this paper, we study the extension of the NC phenomenon to imbalanced datasets under cross-entropy loss function in the context of the unconstrained feature model (UFM). Our contribution is multi-fold compared with the state-of-the-art results: (a) we show that the feature vectors within the same class still collapse to the same mean vector; (b) the mean feature vectors no longer share the same pairwise angle. Instead, those angles depend on sample sizes; (c) we also characterize the sharp threshold on which the minority collapse (the feature vectors of the minority groups collapse to one single vector) will happen; (d) finally, we argue that the effect of the imbalance in datasets diminishes as the sample size grows. Our results provide a complete picture of the NC under the cross-entropy loss for imbalanced datasets. Numerical experiments confirm our theories.
Generalized Independent Noise Condition for Estimating Causal Structure with Latent Variables
http://jmlr.org/papers/v25/23-1052.html
http://jmlr.org/papers/volume25/23-1052/23-1052.pdf
2024Feng Xie, Biwei Huang, Zhengming Chen, Ruichu Cai, Clark Glymour, Zhi Geng, Kun Zhang
We investigate the challenging task of learning causal structure in the presence of latent variables, including locating latent variables, determining their quantity, and identifying causal relationships among both latent and observed variables. To address this, we propose a Generalized Independent Noise (GIN) condition for linear non-Gaussian acyclic causal models that incorporate latent variables, which establishes the independence between a linear combination of certain measured variables and some other measured variables. Specifically, for two observed random vectors $\bf{Y}$ and $\bf{Z}$, GIN holds if and only if $\omega^{\intercal}\mathbf{Y}$ and $\mathbf{Z}$ are statistically independent, where $\omega$ is a non-zero parameter vector determined by the cross-covariance between $\mathbf{Y}$ and $\mathbf{Z}$. We then give necessary and sufficient graphical criteria of the GIN condition in linear non-Gaussian acyclic causal models. From a graphical perspective, roughly speaking, GIN implies the existence of a set $\mathcal{S}$ such that $\mathcal{S}$ is causally earlier (w.r.t. the causal ordering) than $\mathbf{Y}$, and that every active (collider-free) path between $\mathbf{Y}$ and $\mathbf{Z}$ must contain a node from $\mathcal{S}$. Interestingly, we find that the independent noise condition (i.e., if there is no confounder, causes are independent of the residual derived from regressing the effect on the causes) can be seen as a special case of GIN. With such a connection between GIN and latent causal structures, we further leverage the proposed GIN condition, together with a well-designed search procedure, to efficiently estimate Linear, Non-Gaussian Latent Hierarchical Models (LiNGLaHs), where latent confounders may also be causally related and may even follow a hierarchical structure. We show that the underlying causal structure of a LiNGLaH is identifiable in light of GIN conditions under mild assumptions. Experimental results on both synthetic and three real-world data sets show the effectiveness of the proposed approach.
Classification of Data Generated by Gaussian Mixture Models Using Deep ReLU Networks
http://jmlr.org/papers/v25/23-0957.html
http://jmlr.org/papers/volume25/23-0957/23-0957.pdf
2024Tian-Yi Zhou, Xiaoming Huo
This paper studies the binary classification of unbounded data from ${\mathbb R}^d$ generated under Gaussian Mixture Models (GMMs) using deep ReLU neural networks. We obtain — for the first time — non-asymptotic upper bounds and convergence rates of the excess risk (excess misclassification error) for the classification without restrictions on model parameters. While the majority of existing generalization analysis of classification algorithms relies on a bounded domain, we consider an unbounded domain by leveraging the analyticity and fast decay of Gaussian distributions. To facilitate our analysis, we give a novel approximation error bound for general analytic functions using ReLU networks, which may be of independent interest. Gaussian distributions can be adopted nicely to model data arising in applications, e.g., speeches, images, and texts; our results provide a theoretical verification of the observed efficiency of deep neural networks in practical classification problems.
Differentially Private Topological Data Analysis
http://jmlr.org/papers/v25/23-0585.html
http://jmlr.org/papers/volume25/23-0585/23-0585.pdf
2024Taegyu Kang, Sehwan Kim, Jinwon Sohn, Jordan Awan
This paper is the first to attempt differentially private (DP) topological data analysis (TDA), producing near-optimal private persistence diagrams. We analyze the sensitivity of persistence diagrams in terms of the bottleneck distance, and we show that the commonly used Cech complex has sensitivity that does not decrease as the sample size $n$ increases. This makes it challenging for the persistence diagrams of Cech complexes to be privatized. As an alternative, we show that the persistence diagram obtained by the $L^1$-distance to measure (DTM) has sensitivity $O(1/n)$. Based on the sensitivity analysis, we propose using the exponential mechanism whose utility function is defined in terms of the bottleneck distance of the $L^1$-DTM persistence diagrams. We also derive upper and lower bounds of the accuracy of our privacy mechanism; the obtained bounds indicate that the privacy error of our mechanism is near-optimal. We demonstrate the performance of our privatized persistence diagrams through simulations as well as on a real data set tracking human movement.
On the Optimality of Misspecified Spectral Algorithms
http://jmlr.org/papers/v25/23-0383.html
http://jmlr.org/papers/volume25/23-0383/23-0383.pdf
2024Haobo Zhang, Yicheng Li, Qian Lin
In the misspecified spectral algorithms problem, researchers usually assume the underground true function $f_{\rho}^{*} \in [\mathcal{H}]^{s}$, a less-smooth interpolation space of a reproducing kernel Hilbert space (RKHS) $\mathcal{H}$ for some $s\in (0,1)$. The existing minimax optimal results require $\|f_{\rho}^{*}\|_{L^{\infty}}<\infty$ which implicitly requires $s > \alpha_{0}$ where $\alpha_{0}\in (0,1)$ is the embedding index, a constant depending on $\mathcal{H}$. Whether the spectral algorithms are optimal for all $s\in (0,1)$ is an outstanding problem lasting for years. In this paper, we show that spectral algorithms are minimax optimal for any $\alpha_{0}-\frac{1}{\beta} < s < 1$, where $\beta$ is the eigenvalue decay rate of $\mathcal{H}$. We also give several classes of RKHSs whose embedding index satisfies $ \alpha_0 = \frac{1}{\beta} $. Thus, the spectral algorithms are minimax optimal for all $s\in (0,1)$ on these RKHSs.
An Entropy-Based Model for Hierarchical Learning
http://jmlr.org/papers/v25/23-0096.html
http://jmlr.org/papers/volume25/23-0096/23-0096.pdf
2024Amir R. Asadi
Machine learning, the predominant approach in the field of artificial intelligence, enables computers to learn from data and experience. In the supervised learning framework, accurate and efficient learning of dependencies between data instances and their corresponding labels requires auxiliary information about the data distribution and the target function. This central concept aligns with the notion of regularization in statistical learning theory. Real-world datasets are often characterized by multiscale data instance distributions and well-behaved, smooth target functions. Scale-invariant probability distributions, such as power-law distributions, provide notable examples of multiscale data instance distributions in various contexts. This paper introduces a hierarchical learning model that leverages such a multiscale data structure with a multiscale entropy-based training procedure and explores its statistical and computational advantages. The hierarchical learning model is inspired by the logical progression in human learning from easy to complex tasks and features interpretable levels. In this model, the logarithm of any data instance’s norm can be construed as the data instance's complexity, and the allocation of computational resources is tailored to this complexity, resulting in benefits such as increased inference speed. Furthermore, our multiscale analysis of the statistical risk yields stronger guarantees compared to conventional uniform convergence bounds.
Optimal Clustering with Bandit Feedback
http://jmlr.org/papers/v25/22-1088.html
http://jmlr.org/papers/volume25/22-1088/22-1088.pdf
2024Junwen Yang, Zixin Zhong, Vincent Y. F. Tan
This paper considers the problem of online clustering with bandit feedback. A set of arms (or items) can be partitioned into various groups that are unknown. Within each group, the observations associated to each of the arms follow the same distribution with the same mean vector. At each time step, the agent queries or pulls an arm and obtains an independent observation from the distribution it is associated to. Subsequent pulls depend on previous ones as well as the previously obtained samples. The agent's task is to uncover the underlying partition of the arms with the least number of arm pulls and with a probability of error not exceeding a prescribed constant $\delta$. The problem proposed finds numerous applications from clustering of variants of viruses to online market segmentation. We present an instance-dependent information-theoretic lower bound on the expected sample complexity for this task, and design a computationally efficient and asymptotically optimal algorithm, namely Bandit Online Clustering (BOC). The algorithm includes a novel stopping rule for adaptive sequential testing that circumvents the need to exactly solve any NP-hard weighted clustering problem as its subroutines. We show through extensive simulations on synthetic and real-world datasets that BOC's performance matches the lower bound asymptotically, and significantly outperforms a non-adaptive baseline algorithm.
A flexible empirical Bayes approach to multiple linear regression and connections with penalized regression
http://jmlr.org/papers/v25/22-0953.html
http://jmlr.org/papers/volume25/22-0953/22-0953.pdf
2024Youngseok Kim, Wei Wang, Peter Carbonetto, Matthew Stephens
We introduce a new empirical Bayes approach for large-scale multiple linear regression. Our approach combines two key ideas: (i) the use of flexible "adaptive shrinkage" priors, which approximate the nonparametric family of scale mixture of normal distributions by a finite mixture of normal distributions; and (ii) the use of variational approximations to efficiently estimate prior hyperparameters and compute approximate posteriors. Combining these two ideas results in fast and flexible methods, with computational speed comparable to fast penalized regression methods such as the Lasso, and with competitive prediction accuracy across a wide range of scenarios. Further, we provide new results that establish conceptual connections between our empirical Bayes methods and penalized methods. Specifically, we show that the posterior mean from our method solves a penalized regression problem, with the form of the penalty function being learned from the data by directly solving an optimization problem (rather than being tuned by cross-validation). Our methods are implemented in an R package, mr.ash.alpha, available from https://github.com/stephenslab/mr.ash.alpha.
Spectral Analysis of the Neural Tangent Kernel for Deep Residual Networks
http://jmlr.org/papers/v25/22-0597.html
http://jmlr.org/papers/volume25/22-0597/22-0597.pdf
2024Yuval Belfer, Amnon Geifman, Meirav Galun, Ronen Basri
Deep residual network architectures have been shown to achieve superior accuracy over classical feed-forward networks, yet their success is still not fully understood. Focusing on massively over-parameterized, fully connected residual networks with ReLU activation through their respective neural tangent kernels (ResNTK), we provide here a spectral analysis of these kernels. Specifically, we show that, much like NTK for fully connected networks (FC-NTK), for input distributed uniformly on the hypersphere $S^d$, the eigenvalues of ResNTK corresponding to their spherical harmonics eigenfunctions decay polynomially with frequency $k$ as $k^{-d}$. These in turn imply that the set of functions in their Reproducing Kernel Hilbert Space are identical to those of both FC-NTK as well as the standard Laplace kernel. Our spectral analysis allows us to highlight several additional properties of ResNTK, which depend on the choice of a hyper-parameter that balances between the skip and residual connections. Specifically, (1) with no bias, deep ResNTK is significantly biased toward even frequency functions; (2) unlike FC-NTK for deep networks, which is spiky and therefore yields poor generalization, ResNTK is stable and yields small generalization errors. We finally demonstrate these with experiments showing further that these phenomena arise in real networks.
Permuted and Unlinked Monotone Regression in R^d: an approach based on mixture modeling and optimal transport
http://jmlr.org/papers/v25/22-0058.html
http://jmlr.org/papers/volume25/22-0058/22-0058.pdf
2024Martin Slawski, Bodhisattva Sen
Suppose that we have a regression problem with response variable $Y \in \mathbb{R}^d$ and predictor $X \in \mathbb{R}^d$, for $d \ge 1$. In permuted or unlinked regression we have access to separate unordered data on $X$ and $Y$, as opposed to data on $(X,Y)$-pairs in usual regression. So far in the literature the case $d=1$ has received attention, see e.g., the recent papers by Rigollet and Weed [Information & Inference, 8, 619-717] and Balabdaoui et al. [J. Mach. Learn. Res., 22 (172), 1-60]. In this paper, we consider the general multivariate setting with $d \geq 1$. We show that the notion of cyclical monotonicity of the regression function is sufficient for identification and estimation in the permuted/unlinked regression model. We study permutation recovery in the permuted regression setting and develop a computationally efficient and easy-to-use algorithm for denoising based on the Kiefer-Wolfowitz [Ann. Math. Statist., 27, 887-906] nonparametric maximum likelihood estimator and techniques from the theory of optimal transport. We provide explicit upper bounds on the associated mean squared denoising error for Gaussian noise. As in previous work on the case $d = 1$, the permuted/unlinked setting involves slow (logarithmic) rates of convergence rooted in the underlying deconvolution problem. We also provide an extension to a certain class of elliptic noise distributions that includes a multivariate generalization of the Laplace distribution, for which polynomial rates can be obtained. Numerical studies complement our theoretical analysis and show that the proposed approach performs at least on par with the methods in the aforementioned prior work in the case $d = 1$ while achieving substantial reductions in terms of computational complexity.
Volterra Neural Networks (VNNs)
http://jmlr.org/papers/v25/21-1082.html
http://jmlr.org/papers/volume25/21-1082/21-1082.pdf
2024Siddharth Roheda, Hamid Krim, Bo Jiang
The importance of inference in Machine Learning (ML) has led to an explosive number of different proposals, particularly in Deep Learning. In an attempt to reduce the complexity of Convolutional Neural Networks, we propose a Volterra filter-inspired Network architecture. This architecture introduces controlled non-linearities in the form of interactions between the delayed input samples of data. We propose a cascaded implementation of Volterra Filtering so as to significantly reduce the number of parameters required to carry out the same classification task as that of a conventional Neural Network. We demonstrate an efficient parallel implementation of this Volterra Neural Network (VNN), along with its remarkable performance while retaining a relatively simpler and potentially more tractable structure. Furthermore, we show a rather sophisticated adaptation of this network to nonlinearly fuse the RGB (spatial) information and the Optical Flow (temporal) information of a video sequence for action recognition. The proposed approach is evaluated on UCF-101 and HMDB-51 datasets for action recognition, and is shown to outperform state of the art CNN approaches.
Towards Optimal Sobolev Norm Rates for the Vector-Valued Regularized Least-Squares Algorithm
http://jmlr.org/papers/v25/23-1663.html
http://jmlr.org/papers/volume25/23-1663/23-1663.pdf
2024Zhu Li, Dimitri Meunier, Mattes Mollenhauer, Arthur Gretton
We present the first optimal rates for infinite-dimensional vector-valued ridge regression on a continuous scale of norms that interpolate between L2 and the hypothesis space, which we consider as a vector-valued reproducing kernel Hilbert space. These rates allow to treat the misspecified case in which the true regression function is not contained in the hypothesis space. We combine standard assumptions on the capacity of the hypothesis space with a novel tensor product construction of vector-valued interpolation spaces in order to characterize the smoothness of the regression function. Our upper bound not only attains the same rate as real-valued kernel ridge regression, but also removes the assumption that the target regression function is bounded. For the lower bound, we reduce the problem to the scalar setting using a projection argument. We show that these rates are optimal in most cases and independent of the dimension of the output space. We illustrate our results for the special case of vector-valued Sobolev spaces.
Bayesian Regression Markets
http://jmlr.org/papers/v25/23-1385.html
http://jmlr.org/papers/volume25/23-1385/23-1385.pdf
2024Thomas Falconer, Jalal Kazempour, Pierre Pinson
Although machine learning tasks are highly sensitive to the quality of input data, relevant datasets can often be challenging for firms to acquire, especially when held privately by a variety of owners. For instance, if these owners are competitors in a downstream market, they may be reluctant to share information. Focusing on supervised learning for regression tasks, we develop a regression market to provide a monetary incentive for data sharing. Our mechanism adopts a Bayesian framework, allowing us to consider a more general class of regression tasks. We present a thorough exploration of the market properties, and show that similar proposals in literature expose the market agents to sizeable financial risks, which can be mitigated in our setup.
Sharpness-Aware Minimization and the Edge of Stability
http://jmlr.org/papers/v25/23-1285.html
http://jmlr.org/papers/volume25/23-1285/23-1285.pdf
2024Philip M. Long, Peter L. Bartlett
Recent experiments have shown that, often, when training a neural network with gradient descent (GD) with a step size $\eta$, the operator norm of the Hessian of the loss grows until it approximately reaches $2/\eta$, after which it fluctuates around this value. The quantity $2/\eta$ has been called the “edge of stability” based on consideration of a local quadratic approximation of the loss. We perform a similar calculation to arrive at an “edge of stability” for Sharpness-Aware Minimization (SAM), a variant of GD which has been shown to improve its generalization. Unlike the case for GD, the resulting SAM-edge depends on the norm of the gradient. Using three deep learning training tasks, we see empirically that SAM operates on the edge of stability identified by this analysis.
Optimistic Online Mirror Descent for Bridging Stochastic and Adversarial Online Convex Optimization
http://jmlr.org/papers/v25/23-1072.html
http://jmlr.org/papers/volume25/23-1072/23-1072.pdf
2024Sijia Chen, Yu-Jie Zhang, Wei-Wei Tu, Peng Zhao, Lijun Zhang
The stochastically extended adversarial (SEA) model, introduced by Sachs et al. (2022), serves as an interpolation between stochastic and adversarial online convex optimization. Under the smoothness condition on expected loss functions, it is shown that the expected static regret of optimistic follow-the-regularized-leader (FTRL) depends on the cumulative stochastic variance $\sigma_{1:T}^2$ and the cumulative adversarial variation $\Sigma_{1:T}^2$ for convex functions. Sachs et al. (2022) also provide a regret bound based on the maximal stochastic variance $\sigma_{\max}^2$ and the maximal adversarial variation $\Sigma_{\max}^2$ for strongly convex functions. Inspired by their work, we investigate the theoretical guarantees of optimistic online mirror descent (OMD) for the SEA model with smooth expected loss functions. For convex and smooth functions, we obtain the same $\mathcal{O}(\sqrt{\sigma_{1:T}^2}+\sqrt{\Sigma_{1:T}^2})$ regret bound, but with a relaxation of the convexity requirement from individual functions to expected functions. For strongly convex and smooth functions, we establish an $\mathcal{O}\left(\frac{1}{\lambda}\left(\sigma_{\max}^2+\Sigma_{\max}^2\right)\log \left(\left(\sigma_{1:T}^2 + \Sigma_{1:T}^2\right)/\left(\sigma_{\max}^2+\Sigma_{\max}^2\right)\right)\right)$ bound, better than their $\mathcal{O}((\sigma_{\max}^2$ $ + \Sigma_{\max}^2) \log T)$ result. For exp-concave and smooth functions, our approach yields a new $\mathcal{O}(d\log(\sigma_{1:T}^2+\Sigma_{1:T}^2))$ bound. Moreover, we introduce the first expected dynamic regret guarantee for the SEA model with convex and smooth expected functions, which is more favorable than static regret bounds in non-stationary environments. Furthermore, we expand our investigation to scenarios with non-smooth expected loss functions and propose novel algorithms built upon optimistic OMD with an implicit update, successfully attaining both static and dynamic regret guarantees.
Multi-Objective Neural Architecture Search by Learning Search Space Partitions
http://jmlr.org/papers/v25/23-1013.html
http://jmlr.org/papers/volume25/23-1013/23-1013.pdf
2024Yiyang Zhao, Linnan Wang, Tian Guo
Deploying deep learning models requires taking into consideration neural network metrics such as model size, inference latency, and #FLOPs, aside from inference accuracy. This results in deep learning model designers leveraging multi-objective optimization to design effective deep neural networks in multiple criteria. However, applying multi-objective optimizations to neural architecture search (NAS) is nontrivial because NAS tasks usually have a huge search space, along with a non-negligible searching cost. This requires effective multi-objective search algorithms to alleviate the GPU costs. In this work, we implement a novel multi-objectives optimizer based on a recently proposed meta-algorithm called LaMOO on NAS tasks. In a nutshell, LaMOO speedups the search process by learning a model from observed samples to partition the search space and then focusing on promising regions likely to contain a subset of the Pareto frontier. Using LaMOO, we observe an improvement of more than 200% sample efficiency compared to Bayesian optimization and evolutionary-based multi-objective optimizers on different NAS datasets. For example, when combined with LaMOO, qEHVI achieves a 225% improvement in sample efficiency compared to using qEHVI alone in NasBench201. For real-world tasks, LaMOO achieves 97.36% accuracy with only 1.62M #Params on CIFAR10 in only 600 search samples. On ImageNet, our large model reaches 80.4% top-1 accuracy with only 522M #FLOPs.
Fermat Distances: Metric Approximation, Spectral Convergence, and Clustering Algorithms
http://jmlr.org/papers/v25/23-0939.html
http://jmlr.org/papers/volume25/23-0939/23-0939.pdf
2024Nicolás García Trillos, Anna Little, Daniel McKenzie, James M. Murphy
We analyze the convergence properties of Fermat distances, a family of density-driven metrics defined on Riemannian manifolds with an associated probability measure. Fermat distances may be defined either on discrete samples from the underlying measure, in which case they are random, or in the continuum setting, where they are induced by geodesics under a density-distorted Riemannian metric. We prove that discrete, sample-based Fermat distances converge to their continuum analogues in small neighborhoods with a precise rate that depends on the intrinsic dimensionality of the data and the parameter governing the extent of density weighting in Fermat distances. This is done by leveraging novel geometric and statistical arguments in percolation theory that allow for non-uniform densities and curved domains. Our results are then used to prove that discrete graph Laplacians based on discrete, sample-driven Fermat distances converge to corresponding continuum operators. In particular, we show the discrete eigenvalues and eigenvectors converge to their continuum analogues at a dimension-dependent rate, which allows us to interpret the efficacy of discrete spectral clustering using Fermat distances in terms of the resulting continuum limit. The perspective afforded by our discrete-to-continuum Fermat distance analysis leads to new clustering algorithms for data and related insights into efficient computations associated to density-driven spectral clustering. Our theoretical analysis is supported with numerical simulations and experiments on synthetic and real image data.
Spherical Rotation Dimension Reduction with Geometric Loss Functions
http://jmlr.org/papers/v25/23-0547.html
http://jmlr.org/papers/volume25/23-0547/23-0547.pdf
2024Hengrui Luo, Jeremy E. Purvis, Didong Li
Modern datasets often exhibit high dimensionality, yet the data reside in low-dimensional manifolds that can reveal underlying geometric structures critical for data analysis. A prime example of such a dataset is a collection of cell cycle measurements, where the inherently cyclical nature of the process can be represented as a circle or sphere. Motivated by the need to analyze these types of datasets, we propose a nonlinear dimension reduction method, Spherical Rotation Component Analysis (SRCA), that incorporates geometric information to better approximate low-dimensional manifolds. SRCA is a versatile method designed to work in both high-dimensional and small sample size settings. By employing spheres or ellipsoids, SRCA provides a low-rank spherical representation of the data with general theoretic guarantees, effectively retaining the geometric structure of the dataset during dimensionality reduction. A comprehensive simulation study, along with a successful application to human cell cycle data, further highlights the advantages of SRCA compared to state-of-the-art alternatives, demonstrating its superior performance in approximating the manifold while preserving inherent geometric structures.
A PDE-based Explanation of Extreme Numerical Sensitivities and Edge of Stability in Training Neural Networks
http://jmlr.org/papers/v25/23-0137.html
http://jmlr.org/papers/volume25/23-0137/23-0137.pdf
2024Yuxin Sun, Dong Lao, Anthony Yezzi, Ganesh Sundaramoorthi
We discover restrained numerical instabilities in current training practices of deep networks with stochastic gradient descent (SGD), and its variants. We show numerical error (on the order of the smallest floating point bit and thus the most extreme or limiting numerical perturbations induced from floating point arithmetic in training deep nets can be amplified significantly and result in significant test accuracy variance (sensitivities), comparable to the test accuracy variance due to stochasticity in SGD. We show how this is likely traced to instabilities of the optimization dynamics that are restrained, i.e., localized over iterations and regions of the weight tensor space. We do this by presenting a theoretical framework using numerical analysis of partial differential equations (PDE), and analyzing the gradient descent PDE of convolutional neural networks (CNNs). We show that it is stable only under certain conditions on the learning rate and weight decay. We show that rather than blowing up when the conditions are violated, the instability can be restrained. We show this is a consequence of the non-linear PDE associated with the gradient descent of the CNN, whose local linearization changes when over-driving the step size of the discretization, resulting in a stabilizing effect. We link restrained instabilities to the recently discovered Edge of Stability (EoS) phenomena, in which the stable step size predicted by classical theory is exceeded while continuing to optimize the loss and still converging. Because restrained instabilities occur at the EoS, our theory provides new insights and predictions about the EoS, in particular, the role of regularization and the dependence on the network complexity.
Two is Better Than One: Regularized Shrinkage of Large Minimum Variance Portfolios
http://jmlr.org/papers/v25/22-1337.html
http://jmlr.org/papers/volume25/22-1337/22-1337.pdf
2024Taras Bodnar, Nestor Parolya, Erik Thorsen
In this paper, we construct a shrinkage estimator of the global minimum variance (GMV) portfolio by combining two techniques: Tikhonov regularization and direct shrinkage of portfolio weights. More specifically, we employ a double shrinkage approach, where the covariance matrix and portfolio weights are shrunk simultaneously. The ridge parameter controls the stability of the covariance matrix, while the portfolio shrinkage intensity shrinks the regularized portfolio weights to a predefined target. Both parameters simultaneously minimize, with probability one, the out-of-sample variance as the number of assets $p$ and the sample size $n$ tend to infinity, while their ratio $p/n$ tends to a constant $c > 0$. This method can also be seen as the optimal combination of the well-established linear shrinkage approach of Ledoit and Wolf (2004) and the shrinkage of the portfolio weights by Bodnar, Parolya and Schmid (2018). No specific distribution is assumed for the asset returns, except for the assumption of finite moments of order $4 + \varepsilon$ for $\varepsilon > 0$. The performance of the double shrinkage estimator is investigated via extensive simulation and empirical studies. The suggested method significantly outperforms its predecessor (without regularization) and the nonlinear shrinkage approach in terms of the out-of-sample variance, Sharpe ratio, and other empirical measures in the majority of scenarios. Moreover, it maintains the most stable portfolio weights with uniformly smallest turnover.
Decentralized Natural Policy Gradient with Variance Reduction for Collaborative Multi-Agent Reinforcement Learning
http://jmlr.org/papers/v25/22-1036.html
http://jmlr.org/papers/volume25/22-1036/22-1036.pdf
2024Jinchi Chen, Jie Feng, Weiguo Gao, Ke Wei
This paper studies a policy optimization problem arising from collaborative multi-agent reinforcement learning in a decentralized setting where agents communicate with their neighbors over an undirected graph to maximize the sum of their cumulative rewards. A novel decentralized natural policy gradient method, dubbed Momentum-based Decentralized Natural Policy Gradient (MDNPG), is proposed, which incorporates natural gradient, momentum-based variance reduction, and gradient tracking into the decentralized stochastic gradient ascent framework. The $\mathcal{O}(n^{-1}\epsilon^{-3})$ sample complexity for MDNPG to converge to an $\epsilon$-stationary point has been established under standard assumptions, where $n$ is the number of agents. It indicates that MDNPG can achieve the optimal convergence rate for decentralized policy gradient methods and possesses a linear speedup in contrast to centralized optimization methods. Moreover, superior empirical performance of MDNPG over other state-of-the-art algorithms has been demonstrated by extensive numerical experiments.
Log Barriers for Safe Black-box Optimization with Application to Safe Reinforcement Learning
http://jmlr.org/papers/v25/22-0878.html
http://jmlr.org/papers/volume25/22-0878/22-0878.pdf
2024Ilnura Usmanova, Yarden As, Maryam Kamgarpour, Andreas Krause
Optimizing noisy functions online, when evaluating the objective requires experiments on a deployed system, is a crucial task arising in manufacturing, robotics and various other domains. Often, constraints on safe inputs are unknown ahead of time, and we only obtain noisy information, indicating how close we are to violating the constraints. Yet, safety must be guaranteed at all times, not only for the final output of the algorithm. We introduce a general approach for seeking a stationary point in high dimensional non-linear stochastic optimization problems in which maintaining safety during learning is crucial. Our approach called LB-SGD, is based on applying stochastic gradient descent (SGD) with a carefully chosen adaptive step size to a logarithmic barrier approximation of the original problem. We provide a complete convergence analysis of non-convex, convex, and strongly-convex smooth constrained problems, with first-order and zeroth-order feedback. Our approach yields efficient updates and scales better with dimensionality compared to existing approaches. We empirically compare the sample complexity and the computational cost of our method with existing safe learning approaches. Beyond synthetic benchmarks, we demonstrate the effectiveness of our approach on minimizing constraint violation in policy search tasks in safe reinforcement learning (RL).
Cluster-Adaptive Network A/B Testing: From Randomization to Estimation
http://jmlr.org/papers/v25/22-0192.html
http://jmlr.org/papers/volume25/22-0192/22-0192.pdf
2024Yang Liu, Yifan Zhou, Ping Li, Feifang Hu
The performance of A/B testing in both online and offline experimental settings hinges on mitigating network interference and achieving covariate balancing. These experiments often involve an observable network with identifiable clusters, and measurable cluster-level and individual-level attributes. Exploiting these inherent characteristics holds potential for refining experimental design and subsequent statistical analyses. In this article, we propose a novel cluster-adaptive network A/B testing procedure, which contains a cluster-adaptive randomization (CLAR) and a cluster-adjusted estimator (CAE) to facilitate the design of the experiment and enhance the performance of ATE estimation. The CLAR sequentially assigns clusters to minimize the Mahalanobis distance, which further leads to the balance of the cluster-level covariates and the within-cluster-averaged individual-level covariates. The cluster-adjusted estimator (CAE) is tailored to offset biases caused by network interference. The proposed procedure has the following two folds of the desirable properties. First, we show that the Malanobis distance calculated for the two levels of covariates is $O_p(m^{-1})$, where $m$ represents the number of clusters. This result justifies the simultaneous balance of the cluster-level and individual-level covariates. Under mild conditions, we derive the asymptotic normality of CAE and demonstrate the benefit of covariate balancing on improving the precision for estimating ATE. The proposed A/B testing procedure is easy to calculate, consistent, and achieves higher accuracy. Extensive numerical studies are conducted to demonstrate the finite sample property of the proposed network A/B testing procedure.
On the Computational and Statistical Complexity of Over-parameterized Matrix Sensing
http://jmlr.org/papers/v25/21-1437.html
http://jmlr.org/papers/volume25/21-1437/21-1437.pdf
2024Jiacheng Zhuo, Jeongyeol Kwon, Nhat Ho, Constantine Caramanis
We consider solving the low-rank matrix sensing problem with the Factorized Gradient Descent (FGD) method when the specified rank is larger than the true rank. We refer to this as over-parameterized matrix sensing.
If the ground truth signal $\mathbf{X}^* \in \mathbb{R}^{d \times d}$ is of rank $r$, but we try to recover it using $\mathbf{F} \mathbf{F}^\top$ where $\mathbf{F} \in \mathbb{R}^{d \times k}$ and $k>r$, the existing statistical analysis either no longer holds or produces a vacuous statistical error upper bound (infinity) due to the flat local curvature of the loss function around the global maxima.
By decomposing the factorized matrix $\mathbf{F}$ into separate column spaces to capture the impact of using $k > r$, we show that $\left\| {\mathbf{F}_t \mathbf{F}_t - \mathbf{X}^*} \right\|_F^2$ converges sub-linearly to a statistical error of $\tilde{\mathcal{O}} (k d \sigma^2/n)$ after $\tilde{\mathcal{O}}(\frac{\sigma_{r}}{\sigma}\sqrt{\frac{n}{d}})$ iterations, where $\mathbf{F}_t$ is the output of FGD after $t$ iterations, $\sigma^2$ is the variance of the observation noise, $\sigma_{r}$ is the $r$-th largest eigenvalue of $\mathbf{X}^*$, and $n$ is the number of samples.
With a precise characterization of the convergence behavior and the statistical error, our results, therefore, offer a comprehensive picture of the statistical and computational complexity if we solve the over-parameterized matrix sensing problem with FGD.
Optimization-based Causal Estimation from Heterogeneous Environments
http://jmlr.org/papers/v25/21-1028.html
http://jmlr.org/papers/volume25/21-1028/21-1028.pdf
2024Mingzhang Yin, Yixin Wang, David M. Blei
This paper presents a new optimization approach to causal estimation. Given data that contains covariates and an outcome, which covariates are causes of the outcome, and what is the strength of the causality? In classical machine learning (ML), the goal of optimization is to maximize predictive accuracy. However, some covariates might exhibit a non-causal association with the outcome. Such spurious associations provide predictive power for classical ML, but they prevent us from causally interpreting the result. This paper proposes CoCo, an optimization algorithm that bridges the gap between pure prediction and causal inference. CoCo leverages the recently-proposed idea of environments, datasets of covariates/response where the causal relationships remain invariant but where the distribution of the covariates changes from environment to environment. Given datasets from multiple environments—and ones that exhibit sufficient heterogeneity—CoCo maximizes an objective for which the only solution is the causal solution. We describe the theoretical foundations of this approach and demonstrate its effectiveness on simulated and real datasets. Compared to classical ML and existing methods, CoCo provides more accurate estimates of the causal model and more accurate predictions under interventions.
Optimal Locally Private Nonparametric Classification with Public Data
http://jmlr.org/papers/v25/23-1563.html
http://jmlr.org/papers/volume25/23-1563/23-1563.pdf
2024Yuheng Ma, Hanfang Yang
In this work, we investigate the problem of public data assisted non-interactive Local Differentially Private (LDP) learning with a focus on non-parametric classification. Under the posterior drift assumption, we for the first time derive the mini-max optimal convergence rate with LDP constraint. Then, we present a novel approach, the locally differentially private classification tree, which attains the mini-max optimal convergence rate. Furthermore, we design a data-driven pruning procedure that avoids parameter tuning and provides a fast converging estimator. Comprehensive experiments conducted on synthetic and real data sets show the superior performance of our proposed methods. Both our theoretical and experimental findings demonstrate the effectiveness of public data compared to private data, which leads to practical suggestions for prioritizing non-private data collection.
Learning to Warm-Start Fixed-Point Optimization Algorithms
http://jmlr.org/papers/v25/23-1174.html
http://jmlr.org/papers/volume25/23-1174/23-1174.pdf
2024Rajiv Sambharya, Georgina Hall, Brandon Amos, Bartolomeo Stellato
We introduce a machine-learning framework to warm-start fixed-point optimization algorithms. Our architecture consists of a neural network mapping problem parameters to warm starts, followed by a predefined number of fixed-point iterations. We propose two loss functions designed to either minimize the fixed-point residual or the distance to a ground truth solution. In this way, the neural network predicts warm starts with the end-to-end goal of minimizing the downstream loss. An important feature of our architecture is its flexibility, in that it can predict a warm start for fixed-point algorithms run for any number of steps, without being limited to the number of steps it has been trained on. We provide PAC-Bayes generalization bounds on unseen data for common classes of fixed-point operators: contractive, linearly convergent, and averaged. Applying this framework to well-known applications in control, statistics, and signal processing, we observe a significant reduction in the number of iterations and solution time required to solve these problems, through learned warm starts.
Nonparametric Regression Using Over-parameterized Shallow ReLU Neural Networks
http://jmlr.org/papers/v25/23-0918.html
http://jmlr.org/papers/volume25/23-0918/23-0918.pdf
2024Yunfei Yang, Ding-Xuan Zhou
It is shown that over-parameterized neural networks can achieve minimax optimal rates of convergence (up to logarithmic factors) for learning functions from certain smooth function classes, if the weights are suitably constrained or regularized. Specifically, we consider the nonparametric regression of estimating an unknown $d$-variate function by using shallow ReLU neural networks. It is assumed that the regression function is from the H\"older space with smoothness $\alpha<(d+3)/2$ or a variation space corresponding to shallow neural networks, which can be viewed as an infinitely wide neural network. In this setting, we prove that least squares estimators based on shallow neural networks with certain norm constraints on the weights are minimax optimal, if the network width is sufficiently large. As a byproduct, we derive a new size-independent bound for the local Rademacher complexity of shallow ReLU neural networks, which may be of independent interest.
Nonparametric Copula Models for Multivariate, Mixed, and Missing Data
http://jmlr.org/papers/v25/23-0495.html
http://jmlr.org/papers/volume25/23-0495/23-0495.pdf
2024Joseph Feldman, Daniel R. Kowal
Modern data sets commonly feature both substantial missingness and many variables of mixed data types, which present significant challenges for estimation and inference. Complete case analysis, which proceeds using only the observations with fully-observed variables, is often severely biased, while model-based imputation of missing values is limited by the ability of the model to capture complex dependencies among (possibly many) variables of mixed data types. To address these challenges, we develop a novel Bayesian mixture copula for joint and nonparametric modeling of multivariate count, continuous, ordinal, and unordered categorical variables, and deploy this model for inference, prediction, and imputation of missing data. Most uniquely, we introduce a new and computationally efficient strategy for marginal distribution estimation that eliminates the need to specify any marginal models yet delivers posterior consistency for each marginal distribution and the copula parameters under missingness-at-random. Extensive simulation studies demonstrate exceptional modeling and imputation capabilities relative to competing methods, especially with mixed data types, complex missingness mechanisms, and nonlinear dependencies. We conclude with a data analysis that highlights how improper treatment of missing data can distort a statistical analysis, and how the proposed approach offers a resolution.
An Analysis of Quantile Temporal-Difference Learning
http://jmlr.org/papers/v25/23-0154.html
http://jmlr.org/papers/volume25/23-0154/23-0154.pdf
2024Mark Rowland, Rémi Munos, Mohammad Gheshlaghi Azar, Yunhao Tang, Georg Ostrovski, Anna Harutyunyan, Karl Tuyls, Marc G. Bellemare, Will Dabney
We analyse quantile temporal-difference learning (QTD), a distributional reinforcement learning algorithm that has proven to be a key component in several successful large-scale applications of reinforcement learning. Despite these empirical successes, a theoretical understanding of QTD has proven elusive until now. Unlike classical TD learning, which can be analysed with standard stochastic approximation tools, QTD updates do not approximate contraction mappings, are highly non-linear, and may have multiple fixed points. The core result of this paper is a proof of convergence to the fixed points of a related family of dynamic programming procedures with probability 1, putting QTD on firm theoretical footing. The proof establishes connections between QTD and non-linear differential inclusions through stochastic approximation theory and non-smooth analysis.
Conformal Inference for Online Prediction with Arbitrary Distribution Shifts
http://jmlr.org/papers/v25/22-1218.html
http://jmlr.org/papers/volume25/22-1218/22-1218.pdf
2024Isaac Gibbs, Emmanuel J. Candès
We consider the problem of forming prediction sets in an online setting where the distribution generating the data is allowed to vary over time. Previous approaches to this problem suffer from over-weighting historical data and thus may fail to quickly react to the underlying dynamics. Here, we correct this issue and develop a novel procedure with provably small regret over all local time intervals of a given width. We achieve this by modifying the adaptive conformal inference (ACI) algorithm of Gibbs and Candès (2021) to contain an additional step in which the step-size parameter of ACI's gradient descent update is tuned over time. Crucially, this means that unlike ACI, which requires knowledge of the rate of change of the data-generating mechanism, our new procedure is adaptive to both the size and type of the distribution shift. Our methods are highly flexible and can be used in combination with any baseline predictive algorithm that produces point estimates or estimated quantiles of the target without the need for distributional assumptions. We test our techniques on two real-world datasets aimed at predicting stock market volatility and COVID-19 case counts and find that they are robust and adaptive to real-world distribution shifts.
More Efficient Estimation of Multivariate Additive Models Based on Tensor Decomposition and Penalization
http://jmlr.org/papers/v25/22-0578.html
http://jmlr.org/papers/volume25/22-0578/22-0578.pdf
2024Xu Liu, Heng Lian, Jian Huang
We consider parsimonious modeling of high-dimensional multivariate additive models using regression splines, with or without sparsity assumptions. The approach is based on treating the coefficients in the spline expansions as a third-order tensor. Note the data does not have tensor predictors or tensor responses, which distinguishes our study from the existing ones. A Tucker decomposition is used to reduce the number of parameters in the tensor. We also combined the Tucker decomposition with penalization to enable variable selection. The proposed method can avoid the statistical inefficiency caused by estimating a large number of nonparametric functions. We provide sufficient conditions under which the proposed tensor-based estimators achieve the optimal rate of convergence for the nonparametric regression components. We conduct simulation studies to demonstrate the effectiveness of the proposed novel approach in fitting high-dimensional multivariate additive models and illustrate its application on a breast cancer copy number variation and gene expression data set.
A Kernel Test for Causal Association via Noise Contrastive Backdoor Adjustment
http://jmlr.org/papers/v25/21-1409.html
http://jmlr.org/papers/volume25/21-1409/21-1409.pdf
2024Robert Hu, Dino Sejdinovic, Robin J. Evans
Causal inference grows increasingly complex as the dimension of confounders increases. Given treatments $X$, outcomes $Y$, and measured confounders $Z$, we develop a non-parametric method to test the do-null hypothesis that, after an intervention on $X$, there is no marginal dependence of $Y$ on $X$, against the general alternative. Building on the Hilbert-Schmidt Independence Criterion (HSIC) for marginal independence testing, we propose backdoor-HSIC (bd-HSIC), an importance weighted HSIC which combines density ratio estimation with kernel methods. Experiments on simulated data verify the correct size and that the estimator has power for both binary and continuous treatments under a large number of confounding variables. Additionally, we establish convergence properties of the estimators of covariance operators used in bd-HSIC. We investigate the advantages and disadvantages of bd-HSIC against parametric tests as well as the importance of using the do-null testing in contrast to marginal or conditional independence testing. A complete implementation can be found at https://github.com/MrHuff/kgformula.
Assessing the Overall and Partial Causal Well-Specification of Nonlinear Additive Noise Models
http://jmlr.org/papers/v25/23-1397.html
http://jmlr.org/papers/volume25/23-1397/23-1397.pdf
2024Christoph Schultheiss, Peter Bühlmann
We propose a method to detect model misspecifications in nonlinear causal additive and potentially heteroscedastic noise models. We aim to identify predictor variables for which we can infer the causal effect even in cases of such misspecification. We develop a general framework based on knowledge of the multivariate observational data distribution. We then propose an algorithm for finite sample data, discuss its asymptotic properties, and illustrate its performance on simulated and real data.
Simple Cycle Reservoirs are Universal
http://jmlr.org/papers/v25/23-1075.html
http://jmlr.org/papers/volume25/23-1075/23-1075.pdf
2024Boyu Li, Robert Simon Fong, Peter Tino
Reservoir computation models form a subclass of recurrent neural networks with fixed non-trainable input and dynamic coupling weights. Only the static readout from the state space (reservoir) is trainable, thus avoiding the known problems with propagation of gradient information backwards through time. Reservoir models have been successfully applied in a variety of tasks and were shown to be universal approximators of time-invariant fading memory dynamic filters under various settings. Simple cycle reservoirs (SCR) have been suggested as severely restricted reservoir architecture, with equal weight ring connectivity of the reservoir units and input-to-reservoir weights of binary nature with the same absolute value. Such architectures are well suited for hardware implementations without performance degradation in many practical tasks. In this contribution, we rigorously study the expressive power of SCR in the complex domain and show that they are capable of universal approximation of any unrestricted linear reservoir system (with continuous readout) and hence any time-invariant fading memory filter over uniformly bounded input streams.
On the Computational Complexity of Metropolis-Adjusted Langevin Algorithms for Bayesian Posterior Sampling
http://jmlr.org/papers/v25/23-0783.html
http://jmlr.org/papers/volume25/23-0783/23-0783.pdf
2024Rong Tang, Yun Yang
In this paper, we examine the computational complexity of sampling from a Bayesian posterior (or pseudo-posterior) using the Metropolis-adjusted Langevin algorithm (MALA). MALA first employs a discrete-time Langevin SDE to propose a new state, and then adjusts the proposed state using Metropolis-Hastings rejection. Most existing theoretical analyses of MALA rely on the smoothness and strong log-concavity properties of the target distribution, which are often lacking in practical Bayesian problems. Our analysis hinges on statistical large sample theory, which constrains the deviation of the Bayesian posterior from being smooth and log-concave in a very specific way. In particular, we introduce a new technique for bounding the mixing time of a Markov chain with a continuous state space via the $s$-conductance profile, offering improvements over existing techniques in several aspects. By employing this new technique, we establish the optimal parameter dimension dependence of $d^{1/3}$ and condition number dependence of $\kappa$ in the non-asymptotic mixing time upper bound for MALA after the burn-in period, under a standard Bayesian setting where the target posterior distribution is close to a $d$-dimensional Gaussian distribution with a covariance matrix having a condition number $\kappa$. We also prove a matching mixing time lower bound for sampling from a multivariate Gaussian via MALA to complement the upper bound.
Generalization and Stability of Interpolating Neural Networks with Minimal Width
http://jmlr.org/papers/v25/23-0422.html
http://jmlr.org/papers/volume25/23-0422/23-0422.pdf
2024Hossein Taheri, Christos Thrampoulidis
We investigate the generalization and optimization properties of shallow neural-network classifiers trained by gradient descent in the interpolating regime. Specifically, in a realizable scenario where model weights can achieve arbitrarily small training error $\epsilon$ and their distance from initialization is $g(\epsilon)$, we demonstrate that gradient descent with $n$ training data achieves training error $O(g(1/T)^2\big/T)$ and generalization error $O(g(1/T)^2\big/n)$ at iteration $T$, provided there are at least $m=\Omega(g(1/T)^4)$ hidden neurons. We then show that our realizable setting encompasses a special case where data are separable by the model's neural tangent kernel. For this and logistic-loss minimization, we prove the training loss decays at a rate of $\tilde O(1/ T)$ given polylogarithmic number of neurons $m=\Omega(\log^4 (T))$. Moreover, with $m=\Omega(\log^{4} (n))$ neurons and $T\approx n$ iterations, we bound the test loss by $\tilde{O}(1/ n)$. Our results differ from existing generalization outcomes using the algorithmic-stability framework, which necessitate polynomial width and yield suboptimal generalization rates. Central to our analysis is the use of a new self-bounded weak-convexity property, which leads to a generalized local quasi-convexity property for sufficiently parameterized neural-network classifiers. Eventually, despite the objective's non-convexity, this leads to convergence and generalization-gap bounds that resemble those found in the convex setting of linear logistic regression.
Statistical Optimality of Divide and Conquer Kernel-based Functional Linear Regression
http://jmlr.org/papers/v25/22-1326.html
http://jmlr.org/papers/volume25/22-1326/22-1326.pdf
2024Jiading Liu, Lei Shi
Previous analysis of regularized functional linear regression in a reproducing kernel Hilbert space (RKHS) typically requires the target function to be contained in this kernel space. This paper studies the convergence performance of divide-and-conquer estimators in the scenario that the target function does not necessarily reside in the underlying RKHS. As a decomposition-based scalable approach, the divide-and-conquer estimators of functional linear regression can substantially reduce the algorithmic complexities in time and memory. We develop an integral operator approach to establish sharp finite sample upper bounds for prediction with divide-and-conquer estimators under various regularity conditions of explanatory variables and target function. We also prove the asymptotic optimality of the derived rates by building the mini-max lower bounds. Finally, we consider the convergence of noiseless estimators and show that the rates can be arbitrarily fast under mild conditions.
Identifiability and Asymptotics in Learning Homogeneous Linear ODE Systems from Discrete Observations
http://jmlr.org/papers/v25/22-1159.html
http://jmlr.org/papers/volume25/22-1159/22-1159.pdf
2024Yuanyuan Wang, Wei Huang, Mingming Gong, Xi Geng, Tongliang Liu, Kun Zhang, Dacheng Tao
Ordinary Differential Equations (ODEs) have recently gained a lot of attention in machine learning. However, the theoretical aspects, for example, identifiability and asymptotic properties of statistical estimation are still obscure. This paper derives a sufficient condition for the identifiability of homogeneous linear ODE systems from a sequence of equally-spaced error-free observations sampled from a single trajectory. When observations are disturbed by measurement noise, we prove that under mild conditions, the parameter estimator based on the Nonlinear Least Squares (NLS) method is consistent and asymptotic normal with $n^{-1/2}$ convergence rate. Based on the asymptotic normality property, we construct confidence sets for the unknown system parameters and propose a new method to infer the causal structure of the ODE system, that is, inferring whether there is a causal link between system variables. Furthermore, we extend the results to degraded observations, including aggregated and time-scaled ones. To the best of our knowledge, our work is the first systematic study of the identifiability and asymptotic properties in learning linear ODE systems. We also construct simulations with various system dimensions to illustrate the established theoretical results.
Robust Black-Box Optimization for Stochastic Search and Episodic Reinforcement Learning
http://jmlr.org/papers/v25/22-0564.html
http://jmlr.org/papers/volume25/22-0564/22-0564.pdf
2024Maximilian Hüttenrauch, Gerhard Neumann
Black-box optimization is a versatile approach to solve complex problems where the objective function is not explicitly known and no higher order information is available. Due to its general nature, it finds widespread applications in function optimization as well as machine learning, especially episodic reinforcement learning tasks. While traditional black-box optimizers like CMA-ES may falter in noisy scenarios due to their reliance on ranking-based transformations, a promising alternative emerges in the form of the Model-based Relative Entropy Stochastic Search (MORE) algorithm. MORE can be derived from natural policy gradients and compatible function approximation and directly optimizes the expected fitness without resorting to rankings. However, in its original formulation, MORE often cannot achieve state of the art performance. In this paper, we improve MORE by decoupling the update of the search distribution's mean and covariance and an improved entropy scheduling technique based on an evolution path resulting in faster convergence, and a simplified model learning approach in comparison to the original paper. We show that our algorithm performs comparable to state-of-the-art black-box optimizers on standard benchmark functions. Further, it clearly outperforms ranking-based methods and other policy-gradient based black-box algorithms as well as state of the art deep reinforcement learning algorithms when used for episodic reinforcement learning tasks.
Kernel Thinning
http://jmlr.org/papers/v25/21-1334.html
http://jmlr.org/papers/volume25/21-1334/21-1334.pdf
2024Raaz Dwivedi, Lester Mackey
We introduce kernel thinning, a new procedure for compressing a distribution $\mathbb{P}$ more effectively than i.i.d. sampling or standard thinning. Given a suitable reproducing kernel $\mathbf{k}_{\star}$ and $O(n^2)$ time, kernel thinning compresses an $n$-point approximation to $\mathbb{P}$ into a $\sqrt{n}$-point approximation with comparable worst-case integration error across the associated reproducing kernel Hilbert space. The maximum discrepancy in integration error is $O_d(n^{-1/2}\sqrt{\log n})$
in probability for compactly supported $\mathbb{P}$ and $O_d(n^{-\frac{1}{2}} (\log n)^{(d+1)/2}\sqrt{\log\log n})$ for sub-exponential $\mathbb{P}$ on $\mathbb{R}^d$. In contrast, an equal-sized i.i.d. sample from $\mathbb{P}$ suffers $\Omega(n^{-1/4})$ integration error. Our sub-exponential guarantees resemble the classical quasi-Monte Carlo error rates for uniform $\mathbb{P}$ on $[0,1]^d$ but apply to general distributions on $\mathbb{R}^d$ and a wide range of common kernels. Moreover, the same construction delivers near-optimal $L^\infty$ coresets in $O(n^2)$ time. We use our results to derive explicit non-asymptotic maximum mean discrepancy bounds for Gaussian, Mat\'ern, and B-spline kernels and present two vignettes illustrating the practical benefits of kernel thinning over i.i.d. sampling and standard Markov chain Monte Carlo thinning, in dimensions $d=2$ through $100$.
Optimal Algorithms for Stochastic Bilevel Optimization under Relaxed Smoothness Conditions
http://jmlr.org/papers/v25/23-1323.html
http://jmlr.org/papers/volume25/23-1323/23-1323.pdf
2024Xuxing Chen, Tesi Xiao, Krishnakumar Balasubramanian
We consider stochastic bilevel optimization problems involving minimizing an upper-level ($\texttt{UL}$) function that is dependent on the arg-min of a strongly-convex lower-level ($\texttt{LL}$) function. Several algorithms utilize Neumann series to approximate certain matrix inverses involved in estimating the implicit gradient of the $\texttt{UL}$ function (hypergradient). The state-of-the-art StOchastic Bilevel Algorithm ($\texttt{SOBA}$) instead uses stochastic gradient descent steps to solve the linear system associated with the explicit matrix inversion. This modification enables $\texttt{SOBA}$ to obtain a sample complexity of $\mathcal{O}(1/\epsilon^{2})$ for finding an $\epsilon$-stationary point. Unfortunately, the current analysis of $\texttt{SOBA}$ relies on the assumption of higher-order smoothness for the $\texttt{UL}$ and $\texttt{LL}$ functions to achieve optimality. In this paper, we introduce a novel fully single-loop and Hessian-inversion-free algorithmic framework for stochastic bilevel optimization and present a tighter analysis under standard smoothness assumptions (first-order Lipschitzness of the $\texttt{UL}$ function and second-order Lipschitzness of the $\texttt{LL}$ function). Furthermore, we show that a slight modification of our algorithm can handle a more general multi-objective robust bilevel optimization problem. For this case, we obtain the state-of-the-art oracle complexity results demonstrating the generality of both the proposed algorithmic and analytic frameworks. Numerical experiments demonstrate the performance gain of the proposed algorithms over existing ones.
Variational Estimators of the Degree-corrected Latent Block Model for Bipartite Networks
http://jmlr.org/papers/v25/23-0984.html
http://jmlr.org/papers/volume25/23-0984/23-0984.pdf
2024Yunpeng Zhao, Ning Hao, Ji Zhu
Bipartite graphs are ubiquitous across various scientific and engineering fields. Simultaneously grouping the two types of nodes in a bipartite graph via biclustering represents a fundamental challenge in network analysis for such graphs. The latent block model (LBM) is a commonly used model-based tool for biclustering. However, the effectiveness of the LBM is often limited by the influence of row and column sums in the data matrix. To address this limitation, we introduce the degree-corrected latent block model (DC-LBM), which accounts for the varying degrees in row and column clusters, significantly enhancing performance on real-world data sets and simulated data. We develop an efficient variational expectation-maximization algorithm by creating closed-form solutions for parameter estimates in the M steps. Furthermore, we prove the label consistency and the rate of convergence of the variational estimator under the DC-LBM, allowing the expected graph density to approach zero as long as the average expected degrees of rows and columns approach infinity when the size of the graph increases.
Statistical Inference for Fairness Auditing
http://jmlr.org/papers/v25/23-0739.html
http://jmlr.org/papers/volume25/23-0739/23-0739.pdf
2024John J. Cherian, Emmanuel J. Candès
Before deploying a black-box model in high-stakes problems, it is important to evaluate the model’s performance on sensitive subpopulations. For example, in a recidivism prediction task, we may wish to identify demographic groups for which our prediction model has unacceptably high false positive rates or certify that no such groups exist. In this paper, we frame this task, often referred to as “fairness auditing,” in terms of multiple hypothesis testing. We show how the bootstrap can be used to simultaneously bound performance disparities over a collection of groups with statistical guarantees. Our methods can be used to flag subpopulations affected by model underperformance, and certify subpopulations for which the model performs adequately. Crucially, our audit is model-agnostic and applicable to nearly any performance metric or group fairness criterion. Our methods also accommodate extremely rich---even infinite---collections of subpopulations. Further, we generalize beyond subpopulations by showing how to assess performance over certain distribution shifts. We test the proposed methods on benchmark datasets in predictive inference and algorithmic fairness and find that our audits can provide interpretable and trustworthy guarantees.
Adjusted Wasserstein Distributionally Robust Estimator in Statistical Learning
http://jmlr.org/papers/v25/23-0379.html
http://jmlr.org/papers/volume25/23-0379/23-0379.pdf
2024Yiling Xie, Xiaoming Huo
We propose an adjusted Wasserstein distributionally robust estimator---based on a nonlinear transformation of the Wasserstein distributionally robust (WDRO) estimator in statistical learning. The classic WDRO estimator is asymptotically biased, while our adjusted WDRO estimator is asymptotically unbiased, resulting in a smaller asymptotic mean squared error. Further, under certain conditions, our proposed adjustment technique provides a general principle to de-bias asymptotically biased estimators. Specifically, we will investigate how the adjusted WDRO estimator is developed in the generalized linear model, including logistic regression, linear regression, and Poisson regression. Numerical experiments demonstrate the favorable practical performance of the adjusted estimator over the classic one.
DoWhy-GCM: An Extension of DoWhy for Causal Inference in Graphical Causal Models
http://jmlr.org/papers/v25/22-1258.html
http://jmlr.org/papers/volume25/22-1258/22-1258.pdf
2024Patrick Blöbaum, Peter Götz, Kailash Budhathoki, Atalanti A. Mastakouri, Dominik Janzing
We present DoWhy-GCM, an extension of the DoWhy Python library, which leverages graphical causal models. Unlike existing causality libraries, which mainly focus on effect estimation, DoWhy-GCM addresses diverse causal queries, such as identifying the root causes of outliers and distributional changes, attributing causal influences to the data generating process of each node, or diagnosis of causal structures. With DoWhy-GCM, users typically specify cause-effect relations via a causal graph, fit causal mechanisms, and pose causal queries---all with just a few lines of code. The general documentation is available at https://www.pywhy.org/dowhy and the DoWhy-GCM specific code at https://github.com/py-why/dowhy/tree/main/dowhy/gcm.
Flexible Bayesian Product Mixture Models for Vector Autoregressions
http://jmlr.org/papers/v25/22-0717.html
http://jmlr.org/papers/volume25/22-0717/22-0717.pdf
2024Suprateek Kundu, Joshua Lukemire
Bayesian non-parametric methods based on Dirichlet process mixtures have seen tremendous success in various domains and are appealing in being able to borrow information by clustering samples that share identical parameters. However, such methods can face hurdles in heterogeneous settings where objects are expected to cluster only along a subset of axes or where clusters of samples share only a subset of identical parameters. We overcome such limitations by developing a novel class of product of Dirichlet process location-scale mixtures that enables independent clustering at multiple scales, which results in varying levels of information sharing across samples. First, we develop the approach for independent multivariate data. Subsequently we generalize it to multivariate time-series data under the framework of multi-subject Vector Autoregressive (VAR) models that is our primary focus, which go beyond parametric single-subject VAR models. We establish posterior consistency and develop efficient posterior computation for implementation. Extensive numerical studies involving VAR models show distinct advantages over competing methods in terms of estimation, clustering, and feature selection accuracy. Our resting state fMRI analysis from the Human Connectome Project reveals biologically interpretable connectivity differences between distinct intelligence groups, while another air pollution application illustrates the superior forecasting accuracy compared to alternate methods.
A Variational Approach to Bayesian Phylogenetic Inference
http://jmlr.org/papers/v25/22-0348.html
http://jmlr.org/papers/volume25/22-0348/22-0348.pdf
2024Cheng Zhang, Frederick A. Matsen IV
Bayesian phylogenetic inference is currently done via Markov chain Monte Carlo with simple proposal mechanisms. This hinders exploration efficiency and often requires long runs to deliver accurate posterior estimates. In this paper, we present an alternative approach: a variational framework for Bayesian phylogenetic analysis. We propose combining subsplit Bayesian networks, an expressive graphical model for tree topology distributions, and a structured amortization of the branch lengths over tree topologies for a suitable variational family of distributions. We train the variational approximation via stochastic gradient ascent and adopt gradient estimators for continuous and discrete variational parameters separately to deal with the composite latent space of phylogenetic models. We show that our variational approach provides competitive performance to MCMC, while requiring much fewer (though more costly) iterations due to a more efficient exploration mechanism enabled by variational inference. Experiments on a benchmark of challenging real data Bayesian phylogenetic inference problems demonstrate the effectiveness and efficiency of our methods.
Fat-Shattering Dimension of k-fold Aggregations
http://jmlr.org/papers/v25/21-1193.html
http://jmlr.org/papers/volume25/21-1193/21-1193.pdf
2024Idan Attias, Aryeh Kontorovich
We provide estimates on the fat-shattering dimension of aggregation rules of real-valued function classes. The latter consists of all ways of choosing k functions, one from each of the k classes, and computing pointwise an "aggregate" function of these, such as the median, mean, and maximum. The bounds are stated in terms of the fat-shattering dimensions of the component classes. For linear and affine function classes, we provide a considerably sharper upper bound and a matching lower bound, achieving, in particular, an optimal dependence on k. Along the way, we improve several known results in addition to pointing out and correcting a number of erroneous claims in the literature.
Unified Binary and Multiclass Margin-Based Classification
http://jmlr.org/papers/v25/23-1599.html
http://jmlr.org/papers/volume25/23-1599/23-1599.pdf
2024Yutong Wang, Clayton Scott
The notion of margin loss has been central to the development and analysis of algorithms for binary classification. To date, however, there remains no consensus as to the analogue of the margin loss for multiclass classification. In this work, we show that a broad range of multiclass loss functions, including many popular ones, can be expressed in the relative margin form, a generalization of the margin form of binary losses. The relative margin form is broadly useful for understanding and analyzing multiclass losses as shown by our prior work (Wang and Scott, 2020, 2021). To further demonstrate the utility of this way of expressing multiclass losses, we use it to extend the seminal result of Bartlett et al. (2006) on classification-calibration of binary margin losses to multiclass. We then analyze the class of Fenchel-Young losses, and expand the set of these losses that are known to be classification-calibrated.
Neural Feature Learning in Function Space
http://jmlr.org/papers/v25/23-1202.html
http://jmlr.org/papers/volume25/23-1202/23-1202.pdf
2024Xiangxiang Xu, Lizhong Zheng
We present a novel framework for learning system design with neural feature extractors. First, we introduce the feature geometry, which unifies statistical dependence and feature representations in a function space equipped with inner products. This connection defines function-space concepts on statistical dependence, such as norms, orthogonal projection, and spectral decomposition, exhibiting clear operational meanings. In particular, we associate each learning setting with a dependence component and formulate learning tasks as finding corresponding feature approximations. We propose a nesting technique, which provides systematic algorithm designs for learning the optimal features from data samples with off-the-shelf network architectures and optimizers. We further demonstrate multivariate learning applications, including conditional inference and multimodal learning, where we present the optimal features and reveal their connections to classical approaches.
PyGOD: A Python Library for Graph Outlier Detection
http://jmlr.org/papers/v25/23-0963.html
http://jmlr.org/papers/volume25/23-0963/23-0963.pdf
2024Kay Liu, Yingtong Dou, Xueying Ding, Xiyang Hu, Ruitong Zhang, Hao Peng, Lichao Sun, Philip S. Yu
PyGOD is an open-source Python library for detecting outliers in graph data. As the first comprehensive library of its kind, PyGOD supports a wide array of leading graph-based methods for outlier detection under an easy-to-use, well-documented API designed for use by both researchers and practitioners. PyGOD provides modularized components of the different detectors implemented so that users can easily customize each detector for their purposes. To ease the construction of detection workflows, PyGOD offers numerous commonly used utility functions. To scale computation to large graphs, PyGOD supports functionalities for deep models such as sampling and mini-batch processing. PyGOD uses best practices in fostering code reliability and maintainability, including unit testing, continuous integration, and code coverage. To facilitate accessibility, PyGOD is released under a BSD 2-Clause license at https://pygod.org and at the Python Package Index (PyPI).
Blessings and Curses of Covariate Shifts: Adversarial Learning Dynamics, Directional Convergence, and Equilibria
http://jmlr.org/papers/v25/23-0651.html
http://jmlr.org/papers/volume25/23-0651/23-0651.pdf
2024Tengyuan Liang
Covariate distribution shifts and adversarial perturbations present robustness challenges to the conventional statistical learning framework: mild shifts in the test covariate distribution can significantly affect the performance of the statistical model learned based on the training distribution. The model performance typically deteriorates when extrapolation happens: namely, covariates shift to a region where the training distribution is scarce, and naturally, the learned model has little information. For robustness and regularization considerations, adversarial perturbation techniques are proposed as a remedy; however, careful study needs to be carried out about what extrapolation region adversarial covariate shift will focus on, given a learned model. This paper precisely characterizes the extrapolation region, examining both regression and classification in an infinite-dimensional setting. We study the implications of adversarial covariate shifts to subsequent learning of the equilibrium---the Bayes optimal model---in a sequential game framework. We exploit the dynamics of the adversarial learning game and reveal the curious effects of the covariate shift to equilibrium learning and experimental design. In particular, we establish two directional convergence results that exhibit distinctive phenomena: (1) a blessing in regression, the adversarial covariate shifts in an exponential rate to an optimal experimental design for rapid subsequent learning; (2) a curse in classification, the adversarial covariate shifts in a subquadratic rate to the hardest experimental design trapping subsequent learning.
Fixed points of nonnegative neural networks
http://jmlr.org/papers/v25/23-0167.html
http://jmlr.org/papers/volume25/23-0167/23-0167.pdf
2024Tomasz J. Piotrowski, Renato L. G. Cavalcante, Mateusz Gabor
We use fixed point theory to analyze nonnegative neural networks, which we define as neural networks that map nonnegative vectors to nonnegative vectors. We first show that nonnegative neural networks with nonnegative weights and biases can be recognized as monotonic and (weakly) scalable mappings within the framework of nonlinear Perron-Frobenius theory. This fact enables us to provide conditions for the existence of fixed points of nonnegative neural networks having inputs and outputs of the same dimension, and these conditions are weaker than those recently obtained using arguments in convex analysis. Furthermore, we prove that the shape of the fixed point set of nonnegative neural networks with nonnegative weights and biases is an interval, which under mild conditions degenerates to a point. These results are then used to obtain the existence of fixed points of more general nonnegative neural networks. From a practical perspective, our results contribute to the understanding of the behavior of autoencoders, and we also offer valuable mathematical machinery for future developments in deep equilibrium models.
Learning with Norm Constrained, Over-parameterized, Two-layer Neural Networks
http://jmlr.org/papers/v25/22-1250.html
http://jmlr.org/papers/volume25/22-1250/22-1250.pdf
2024Fanghui Liu, Leello Dadi, Volkan Cevher
Recent studies show that a reproducing kernel Hilbert space (RKHS) is not a suitable space to model functions by neural networks as the curse of dimensionality (CoD) cannot be evaded when trying to approximate even a single ReLU neuron. In this paper, we study a suitable function space for over-parameterized two-layer neural networks with bounded norms (e.g., the path norm, the Barron norm) in the perspective of sample complexity and generalization properties. First, we show that the path norm (as well as the Barron norm) is able to obtain width-independence sample complexity bounds, which allows for uniform convergence guarantees. Based on this result, we derive the improved result of metric entropy for $\epsilon$-covering up to $O(\epsilon^{-\frac{2d}{d+2}})$ ($d$ is the input dimension and the depending constant is at most polynomial order of $d$) via the convex hull technique, which demonstrates the separation with kernel methods with $\Omega(\epsilon^{-d})$ to learn the target function in a Barron space. Second, this metric entropy result allows for building a sharper generalization bound under a general moment hypothesis setting, achieving the rate at $O(n^{-\frac{d+2}{2d+2}})$. Our analysis is novel in that it offers a sharper and refined estimation for metric entropy (with a clear dependence relationship on the dimension $d$) and unbounded sampling in the estimation of the sample error and the output error.
A Survey on Multi-player Bandits
http://jmlr.org/papers/v25/22-0643.html
http://jmlr.org/papers/volume25/22-0643/22-0643.pdf
2024Etienne Boursier, Vianney Perchet
Due mostly to its application to cognitive radio networks, multiplayer bandits gained a lot of interest in the last decade. A considerable progress has been made on its theoretical aspect. However, the current algorithms are far from applicable and many obstacles remain between these theoretical results and a possible implementation of multiplayer bandits algorithms in real communication networks. This survey contextualizes and organizes the rich multiplayer bandits literature. In light of the existing works, some clear directions for future research appear. We believe that a further study of these different directions might lead to theoretical algorithms adapted to real-world situations.
Transport-based Counterfactual Models
http://jmlr.org/papers/v25/21-1440.html
http://jmlr.org/papers/volume25/21-1440/21-1440.pdf
2024Lucas De Lara, Alberto González-Sanz, Nicholas Asher, Laurent Risser, Jean-Michel Loubes
Counterfactual frameworks have grown popular in machine learning for both explaining algorithmic decisions but also defining individual notions of fairness, more intuitive than typical group fairness conditions. However, state-of-the-art models to compute counterfactuals are either unrealistic or unfeasible. In particular, while Pearl's causal inference provides appealing rules to calculate counterfactuals, it relies on a model that is unknown and hard to discover in practice. We address the problem of designing realistic and feasible counterfactuals in the absence of a causal model. We define transport-based counterfactual models as collections of joint probability distributions between observable distributions, and show their connection to causal counterfactuals. More specifically, we argue that optimal-transport theory defines relevant transport-based counterfactual models, as they are numerically feasible, statistically-faithful, and can coincide under some assumptions with causal counterfactual models. Finally, these models make counterfactual approaches to fairness feasible, and we illustrate their practicality and efficiency on fair learning. With this paper, we aim at laying out the theoretical foundations for a new, implementable approach to counterfactual thinking.
Adaptive Latent Feature Sharing for Piecewise Linear Dimensionality Reduction
http://jmlr.org/papers/v25/21-0146.html
http://jmlr.org/papers/volume25/21-0146/21-0146.pdf
2024Adam Farooq, Yordan P. Raykov, Petar Raykov, Max A. Little
Linear Gaussian exploratory tools such as principal component analysis (PCA) and factor analysis (FA) are widely used for exploratory analysis, pre-processing, data visualization, and related tasks. Because the linear-Gaussian assumption is restrictive, for very high dimensional problems, they have been replaced by robust, sparse extensions or more flexible discrete-continuous latent feature models. Discrete-continuous latent feature models specify a dictionary of features dependent on subsets of the data and then infer the likelihood that each data point shares any of these features. This is often achieved using rich-get-richer assumptions about the feature allocation process where the dictionary tries to couple the feature frequency with the portion of total variance that it explains. In this work, we propose an alternative approach that allows for better control over the feature to data point allocation. This new approach is based on two-parameter discrete distribution models which decouple feature sparsity and dictionary size, hence capturing both common and rare features in a parsimonious way. The new framework is used to derive a novel adaptive variant of factor analysis (aFA), as well as an adaptive probabilistic principal component analysis (aPPCA) capable of flexible structure discovery and dimensionality reduction in a wide variety of scenarios. We derive both standard Gibbs sampling, as well as efficient expectation-maximisation inference approximations converging orders of magnitude faster, to a reasonable point estimate solution. The utility of the proposed aPPCA and aFA models is demonstrated on standard tasks such as feature learning, data visualization, and data whitening. We show that aPPCA and aFA can extract interpretable, high-level features for raw MNIST or COLI-20 images, or when applied to the analysis of autoencoder features. We also demonstrate that replacing common PCA pre-processing pipelines in the analysis of functional magnetic resonance imaging (fMRI) data with aPPCA, leads to more robust and better-localised blind source separation of neural activity.
Topological Node2vec: Enhanced Graph Embedding via Persistent Homology
http://jmlr.org/papers/v25/23-1185.html
http://jmlr.org/papers/volume25/23-1185/23-1185.pdf
2024Yasuaki Hiraoka, Yusuke Imoto, Théo Lacombe, Killian Meehan, Toshiaki Yachimura
Node2vec is a graph embedding method that learns a vector representation for each node of a weighted graph while seeking to preserve relative proximity and global structure. Numerical experiments suggest Node2vec struggles to recreate the topology of the input graph. To resolve this we introduce a topological loss term to be added to the training loss of Node2vec which tries to align the persistence diagram (PD) of the resulting embedding as closely as possible to that of the input graph. Following results in computational optimal transport, we carefully adapt entropic regularization to PD metrics, allowing us to measure the discrepancy between PDs in a differentiable way. Our modified loss function can then be minimized through gradient descent to reconstruct both the geometry and the topology of the input graph. We showcase the benefits of this approach using demonstrative synthetic examples.
Granger Causal Inference in Multivariate Hawkes Processes by Minimum Message Length
http://jmlr.org/papers/v25/23-1066.html
http://jmlr.org/papers/volume25/23-1066/23-1066.pdf
2024Katerina Hlaváčková-Schindler, Anna Melnykova, Irene Tubikanec
Multivariate Hawkes processes (MHPs) are versatile probabilistic tools used to model various real-life phenomena: earthquakes, operations on stock markets, neuronal activity, virus propagation and many others. In this paper, we focus on MHPs with exponential decay kernels and estimate connectivity graphs, which represent the Granger causal relations between their components. We approach this inference problem by proposing an optimization criterion and model selection algorithm based on the minimum message length (MML) principle. MML compares Granger causal models using the Occam's razor principle in the following way: even when models have a comparable goodness-of-fit to the observed data, the one generating the most concise explanation of the data is preferred. While most of the state-of-art methods using lasso-type penalization tend to overfitting in scenarios with short time horizons, the proposed MML-based method achieves high F1 scores in these settings. We conduct a numerical study comparing the proposed algorithm to other related classical and state-of-art methods, where we achieve the highest F1 scores in specific sparse graph settings. We illustrate the proposed method also on G7 sovereign bond data and obtain causal connections, which are in agreement with the expert knowledge available in the literature.
Representation Learning via Manifold Flattening and Reconstruction
http://jmlr.org/papers/v25/23-0615.html
http://jmlr.org/papers/volume25/23-0615/23-0615.pdf
2024Michael Psenka, Druv Pai, Vishal Raman, Shankar Sastry, Yi Ma
A common assumption for real-world, learnable data is its possession of some low-dimensional structure, and one way to formalize this structure is through the manifold hypothesis: that learnable data lies near some low-dimensional manifold. Deep learning architectures often have a compressive autoencoder component, where data is mapped to a lower-dimensional latent space, but often many architecture design choices are done by hand, since such models do not inherently exploit mathematical structure of the data. To utilize this geometric data structure, we propose an iterative process in the style of a geometric flow for explicitly constructing a pair of neural networks layer-wise that linearize and reconstruct an embedded submanifold, from finite samples of this manifold. Our such-generated neural networks, called Flattening Networks (FlatNet), are theoretically interpretable, computationally feasible at scale, and generalize well to test data, a balance not typically found in manifold-based learning methods. We present empirical results and comparisons to other models on synthetic high-dimensional manifold data and 2D image data. Our code is publicly available.
Bagging Provides Assumption-free Stability
http://jmlr.org/papers/v25/23-0536.html
http://jmlr.org/papers/volume25/23-0536/23-0536.pdf
2024Jake A. Soloff, Rina Foygel Barber, Rebecca Willett
Bagging is an important technique for stabilizing machine learning models. In this paper, we derive a finite-sample guarantee on the stability of bagging for any model. Our result places no assumptions on the distribution of the data, on the properties of the base algorithm, or on the dimensionality of the covariates. Our guarantee applies to many variants of bagging and is optimal up to a constant. Empirical results validate our findings, showing that bagging successfully stabilizes even highly unstable base algorithms.
Fairness guarantees in multi-class classification with demographic parity
http://jmlr.org/papers/v25/23-0322.html
http://jmlr.org/papers/volume25/23-0322/23-0322.pdf
2024Christophe Denis, Romuald Elie, Mohamed Hebiri, François Hu
Algorithmic Fairness is an established area of machine learning, willing to reduce the influence of hidden bias in the data. Yet, despite its wide range of applications, very few works consider the multi-class classification setting from the fairness perspective. We focus on this question and extend the definition of approximate fairness in the case of Demographic Parity to multi-class classification. We specify the corresponding expressions of the optimal fair classifiers in the attribute-aware case and both for binary and multi-categorical sensitive attributes. This suggests a plug-in data-driven procedure, for which we establish theoretical guarantees. The enhanced estimator is proved to mimic the behavior of the optimal rule both in terms of fairness and risk. Notably, fairness guarantees are distribution-free. The approach is evaluated on both synthetic and real datasets and reveals very effective in decision making with a preset level of unfairness. In addition, our method is competitive (if not better) with the state-of-the-art in binary and multi-class tasks.
Regimes of No Gain in Multi-class Active Learning
http://jmlr.org/papers/v25/23-0234.html
http://jmlr.org/papers/volume25/23-0234/23-0234.pdf
2024Gan Yuan, Yunfan Zhao, Samory Kpotufe
We consider nonparametric classification with smooth regression functions, where it is well known that notions of margin in $\mathbb{P}(Y=y|X=x)$ determine fast or slow rates in both active and passive learning. Here we elucidate a striking distinction---most relevant in multi-class settings---between active and passive learning. Namely, we show that some seemingly benign nuances in notions of margin---involving the uniqueness of the Bayes classes, which have no apparent effect on rates in passive learning---determine whether or not any active learner can outperform passive learning rates. While a shorter conference version of this work already alluded to these nuances, it focused on the binary case and thus failed to be conclusive as to the source of difficulty in the multi-class setting: we show here that it suffices that the Bayes classifier fails to be unique, as opposed to needing all classes to be Bayes optimal, for active learning to yield no gain over passive learning.
Learning Optimal Dynamic Treatment Regimens Subject to Stagewise Risk Controls
http://jmlr.org/papers/v25/23-0072.html
http://jmlr.org/papers/volume25/23-0072/23-0072.pdf
2024Mochuan Liu, Yuanjia Wang, Haoda Fu, Donglin Zeng
Dynamic treatment regimens (DTRs) aim at tailoring individualized sequential treatment rules that maximize cumulative beneficial outcomes by accommodating patients' heterogeneity in decision-making. For many chronic diseases including type 2 diabetes mellitus (T2D), treatments are usually multifaceted in the sense that aggressive treatments with a higher expected reward are also likely to elevate the risk of acute adverse events. In this paper, we propose a new weighted learning framework, namely benefit-risk dynamic treatment regimens (BR-DTRs), to address the benefit-risk trade-off. The new framework relies on a backward learning procedure by restricting the induced risk of the treatment rule to be no larger than a pre-specified risk constraint at each treatment stage. Computationally, the estimated treatment rule solves a weighted support vector machine problem with a modified smooth constraint. Theoretically, we show that the proposed DTRs are Fisher consistent, and we further obtain the convergence rates for both the value and risk functions. Finally, the performance of the proposed method is demonstrated via extensive simulation studies and application to a real study for T2D patients.
Margin-Based Active Learning of Classifiers
http://jmlr.org/papers/v25/22-1127.html
http://jmlr.org/papers/volume25/22-1127/22-1127.pdf
2024Marco Bressan, Nicolò Cesa-Bianchi, Silvio Lattanzi, Andrea Paudice
We study active learning of multiclass classifiers, focusing on the realizable transductive setting. The input is a finite subset $X$ of some metric space, and the concept to be learned is a partition $\mathcal{C}$ of $X$ into $k$ classes. The goal is to learn $\mathcal{C}$ by querying the labels of as few elements of $X$ as possible. This is a useful subroutine in pool-based active learning, and is motivated by applications where labels are expensive to obtain. Our main result is that, in very different settings, there exist interesting notions of margin that yield efficient active learning algorithms. First, we consider the case $X \subset \mathbb{R}^m$, assuming that each class has an unknown "personalized" margin separating it from the rest. Second, we consider the case where $X$ is a finite metric space, and the classes are convex with margin according to the geodesic distances in the thresholded connectivity graph. In both cases, we give algorithms that learn $\mathcal{C}$ exactly, in polynomial time, using $\mathcal{O}(\log n)$ label queries, where $\mathcal{O}(\cdot)$ hides a near-optimal dependence on the dimension of the metric spaces. Our results actually hold for or can be adapted to more general settings, such as pseudometric and semimetric spaces.
Random Subgraph Detection Using Queries
http://jmlr.org/papers/v25/22-0395.html
http://jmlr.org/papers/volume25/22-0395/22-0395.pdf
2024Wasim Huleihel, Arya Mazumdar, Soumyabrata Pal
The planted densest subgraph detection problem refers to the task of testing whether in a given (random) graph there is a subgraph that is unusually dense. Specifically, we observe an undirected and unweighted graph on $n$ vertices. Under the null hypothesis, the graph is a realization of an Erdös-R{\'e}nyi graph with edge probability (or, density) $q$. Under the alternative, there is a subgraph on $k$ vertices with edge probability $p>q$. The statistical as well as the computational barriers of this problem are well-understood for a wide range of the edge parameters $p$ and $q$. In this paper, we consider a natural variant of the above problem, where one can only observe a relatively small part of the graph using adaptive edge queries. For this model, we determine the number of queries necessary and sufficient (accompanied with a quasi-polynomial optimal algorithm) for detecting the presence of the planted subgraph. We also propose a polynomial-time algorithm which is able to detect the planted subgraph, albeit with more queries compared to the above lower bound. We conjecture that in the leftover regime, no polynomial-time algorithms exist. Our results resolve two open questions posed in the past literature.
Classification with Deep Neural Networks and Logistic Loss
http://jmlr.org/papers/v25/22-0049.html
http://jmlr.org/papers/volume25/22-0049/22-0049.pdf
2024Zihan Zhang, Lei Shi, Ding-Xuan Zhou
Deep neural networks (DNNs) trained with the logistic loss (also known as the cross entropy loss) have made impressive advancements in various binary classification tasks. Despite the considerable success in practice, generalization analysis for binary classification with deep neural networks and the logistic loss remains scarce. The unboundedness of the target function for
the logistic loss in binary classification is the main obstacle to deriving satisfactory generalization bounds. In this paper, we aim to fill this gap by developing a novel theoretical analysis and using it to establish tight generalization bounds for training fully connected ReLU DNNs with logistic loss in binary classification. Our generalization analysis is based on an elegant oracle-type inequality which enables us to deal with the boundedness restriction of the target function. Using this oracle-type inequality, we establish generalization bounds for fully connected ReLU DNN classifiers $\hat{f}^{\text{FNN}}_n$ trained by empirical logistic risk minimization with respect to i.i.d. samples of size $n$, which lead to sharp rates of convergence as $n\to\infty$. In particular, we obtain optimal convergence rates for $\hat{f}^{\text{FNN}}_n$ (up to some logarithmic factor) only requiring the Hölder smoothness of the conditional class probability $\eta$ of data. Moreover, we consider a compositional assumption that requires $\eta$ to be the composition of several vector-valued multivariate functions of which each component function is either a maximum value function or a Hölder smooth function only depending on a small number of its input variables. Under this assumption, we can even derive optimal convergence rates for $\hat{f}^{\text{FNN}}_n$ (up to some logarithmic factor) which are independent of the input dimension of data. This result explains why in practice DNN classifiers can overcome the curse of dimensionality and perform well in high-dimensional classification problems. Furthermore, we establish dimension-free rates of convergence under other circumstances such as when the decision boundary is piecewise smooth and the input data are bounded away from it. Besides the novel oracle-type inequality, the sharp convergence rates presented in our paper also owe to a tight error bound for approximating the natural logarithm function near zero (where it is unbounded) by ReLU DNNs. In addition, we justify our claims for the optimality of rates by proving corresponding minimax lower bounds. All these results are new in the literature and will deepen our theoretical understanding of classification with deep neural networks.
Spectral learning of multivariate extremes
http://jmlr.org/papers/v25/21-1367.html
http://jmlr.org/papers/volume25/21-1367/21-1367.pdf
2024Marco Avella Medina, Richard A Davis, Gennady Samorodnitsky
We propose a spectral clustering algorithm for analyzing the dependence structure of multivariate extremes. More specifically, we focus on the asymptotic dependence of multivariate extremes characterized by the angular or spectral measure in extreme value theory. Our work studies the theoretical performance of spectral clustering based on a random $k$-nearest neighbor graph constructed from an extremal sample, i.e., the angular part of random vectors for which the radius exceeds a large threshold. In particular, we derive the asymptotic distribution of extremes arising from a linear factor model and prove that, under certain conditions, spectral clustering can consistently identify the clusters of extremes arising in this model. Leveraging this result we propose a simple consistent estimation strategy for learning the angular measure. Our theoretical findings are complemented with numerical experiments illustrating the finite sample performance of our methods.
Sum-of-norms clustering does not separate nearby balls
http://jmlr.org/papers/v25/21-0495.html
http://jmlr.org/papers/volume25/21-0495/21-0495.pdf
2024Alexander Dunlap, Jean-Christophe Mourrat
Sum-of-norms clustering is a popular convexification of $K$-means clustering. We show that, if the dataset is made of a large number of independent random variables distributed according to the uniform measure on the union of two disjoint balls of unit radius, and if the balls are sufficiently close to one another, then sum-of-norms clustering will typically fail to recover the decomposition of the dataset into two clusters. As the dimension tends to infinity, this happens even when the distance between the centers of the two balls is taken to be as large as $2\sqrt{2}$. In order to show this, we introduce and analyze a continuous version of sum-of-norms clustering, where the dataset is replaced by a general measure. In particular, we state and prove a local-global characterization of the clustering that seems to be new even in the case of discrete datapoints.
An Algorithm with Optimal Dimension-Dependence for Zero-Order Nonsmooth Nonconvex Stochastic Optimization
http://jmlr.org/papers/v25/23-1159.html
http://jmlr.org/papers/volume25/23-1159/23-1159.pdf
2024Guy Kornowski, Ohad Shamir
We study the complexity of producing $(\delta,\epsilon)$-stationary points of Lipschitz objectives which are possibly neither smooth nor convex, using only noisy function evaluations. Recent works proposed several stochastic zero-order algorithms that solve this task, all of which suffer from a dimension-dependence of $\Omega(d^{3/2})$ where $d$ is the dimension of the problem, which was conjectured to be optimal. We refute this conjecture by providing a faster algorithm that has complexity $O(d\delta^{-1}\epsilon^{-3})$, which is optimal (up to numerical constants) with respect to $d$ and also optimal with respect to the accuracy parameters $\delta,\epsilon$, thus solving an open question due to Lin et al. (2022). Moreover, the convergence rate achieved by our algorithm is also optimal for smooth objectives, proving that in the nonconvex stochastic zero-order setting, nonsmooth optimization is as easy as smooth optimization. We provide algorithms that achieve the aforementioned convergence rate in expectation as well as with high probability. Our analysis is based on a simple yet powerful lemma regarding the Goldstein-subdifferential set, which allows utilizing recent advancements in first-order nonsmooth nonconvex optimization.
Linear Distance Metric Learning with Noisy Labels
http://jmlr.org/papers/v25/23-0791.html
http://jmlr.org/papers/volume25/23-0791/23-0791.pdf
2024Meysam Alishahi, Anna Little, Jeff M. Phillips
In linear distance metric learning, we are given data in one Euclidean metric space and the goal is to find an appropriate linear map to another Euclidean metric space which respects certain distance conditions as much as possible. In this paper, we formalize a simple and elegant method which reduces to a general continuous convex loss optimization problem, and for different noise models we derive the corresponding loss functions. We show that even if the data is noisy, the ground truth linear metric can be learned with any precision provided access to enough samples, and we provide a corresponding sample complexity bound. Moreover, we present an effective way to truncate the learned model to a low-rank model that can provably maintain the accuracy in the loss function and in parameters -- the first such results of this type. Several experimental observations on synthetic and real data sets support and inform our theoretical results.
OpenBox: A Python Toolkit for Generalized Black-box Optimization
http://jmlr.org/papers/v25/23-0537.html
http://jmlr.org/papers/volume25/23-0537/23-0537.pdf
2024Huaijun Jiang, Yu Shen, Yang Li, Beicheng Xu, Sixian Du, Wentao Zhang, Ce Zhang, Bin Cui
Black-box optimization (BBO) has a broad range of applications, including automatic machine learning, experimental design, and database knob tuning. However, users still face challenges when applying BBO methods to their problems at hand with existing software packages in terms of applicability, performance, and efficiency. This paper presents OpenBox, an open-source BBO toolkit with improved usability. It implements user-friendly interfaces and visualization for users to define and manage their tasks. The modular design behind OpenBox facilitates its flexible deployment in existing systems. Experimental results demonstrate the effectiveness and efficiency of OpenBox over existing systems. The source code of OpenBox is available at https://github.com/PKU-DAIR/open-box.
Generative Adversarial Ranking Nets
http://jmlr.org/papers/v25/23-0461.html
http://jmlr.org/papers/volume25/23-0461/23-0461.pdf
2024Yinghua Yao, Yuangang Pan, Jing Li, Ivor W. Tsang, Xin Yao
We propose a new adversarial training framework -- generative adversarial ranking networks (GARNet) to learn from user preferences among a list of samples so as to generate data meeting user-specific criteria. Verbosely, GARNet consists of two modules: a ranker and a generator. The generator fools the ranker to raise generated samples to the top; while the ranker learns to rank generated samples at the bottom. Meanwhile, the ranker learns to rank samples regarding the interested property by training with preferences collected on real samples. The adversarial ranking game between the ranker and the generator enables an alignment between the generated data distribution and the user-preferred data distribution with theoretical guarantees and empirical verification. Specifically, we first prove that when training with full preferences on a discrete property, the learned distribution of GARNet rigorously coincides with the distribution specified by the given score vector based on user preferences. The theoretical results are then extended to partial preferences on a discrete property and further generalized to preferences on a continuous property. Meanwhile, numerous experiments show that GARNet can retrieve the distribution of user-desired data based on full/partial preferences in terms of various interested properties (i.e., discrete/continuous property, single/multiple properties). Code is available at https://github.com/EvaFlower/GARNet.
Predictive Inference with Weak Supervision
http://jmlr.org/papers/v25/23-0253.html
http://jmlr.org/papers/volume25/23-0253/23-0253.pdf
2024Maxime Cauchois, Suyash Gupta, Alnur Ali, John C. Duchi
The expense of acquiring labels in large-scale statistical machine learning makes partially and weakly-labeled data attractive, though it is not always apparent how to leverage such data for model fitting or validation. We present a methodology to bridge the gap between partial supervision and validation, developing a conformal prediction framework to provide valid predictive confidence sets---sets that cover a true label with a prescribed probability, independent of the underlying distribution---using weakly labeled data. To do so, we introduce a (necessary) new notion of coverage and predictive validity, then develop several application scenarios, providing efficient algorithms for classification and several large-scale structured prediction problems. We corroborate the hypothesis that the new coverage definition allows for tighter and more informative (but valid) confidence sets through several experiments.
Functions with average smoothness: structure, algorithms, and learning
http://jmlr.org/papers/v25/23-0182.html
http://jmlr.org/papers/volume25/23-0182/23-0182.pdf
2024Yair Ashlagi, Lee-Ad Gottlieb, Aryeh Kontorovich
We initiate a program of average smoothness analysis for efficiently learning real-valued functions on metric spaces. Rather than using the Lipschitz constant as the regularizer, we define a local slope at each point and gauge the function complexity as the average of these values. Since the mean can be dramatically smaller than the maximum, this complexity measure can yield considerably sharper generalization bounds --- assuming that these admit a refinement where the Lipschitz constant is replaced by our average of local slopes. Our first major contribution is to obtain just such distribution-sensitive bounds. This required overcoming a number of technical challenges, perhaps the most formidable of which was bounding the empirical covering numbers, which can be much worse-behaved than the ambient ones. Our combinatorial results are accompanied by efficient algorithms for smoothing the labels of the random sample, as well as guarantees that the extension from the sample to the whole space will continue to be, with high probability, smooth on average. Along the way we discover a surprisingly rich combinatorial and analytic structure in the function class we define.
Differentially Private Data Release for Mixed-type Data via Latent Factor Models
http://jmlr.org/papers/v25/22-1324.html
http://jmlr.org/papers/volume25/22-1324/22-1324.pdf
2024Yanqing Zhang, Qi Xu, Niansheng Tang, Annie Qu
Differential privacy is a particular data privacy-preserving technology which enables synthetic data or statistical analysis results to be released with a minimum disclosure of private information from individual records. The tradeoff between privacy-preserving and utility guarantee is always a challenge for differential privacy technology, especially for synthetic data generation. In this paper, we propose a differentially private data synthesis algorithm for mixed-type data with correlation based on latent factor models. The proposed method can add a relatively small amount of noise to synthetic data under a given level of privacy protection while capturing correlation information. Moreover, the proposed algorithm can generate synthetic data preserving the same data type as mixed-type original data, which greatly improves the utility of synthetic data. The key idea of our method is to perturb the factor matrix and factor loading matrix to construct a synthetic data generation model, and to utilize link functions with privacy protection to ensure consistency of synthetic data type with original data. The proposed method can generate privacy-preserving synthetic data at low computation cost even when the original data is high-dimensional. In theory, we establish differentially private properties of the proposed method. Our numerical studies also demonstrate superb performance of the proposed method on the utility guarantee of the statistical analysis based on privacy-preserved synthetic data.
The Non-Overlapping Statistical Approximation to Overlapping Group Lasso
http://jmlr.org/papers/v25/22-1105.html
http://jmlr.org/papers/volume25/22-1105/22-1105.pdf
2024Mingyu Qi, Tianxi Li
The group lasso penalty is widely used to introduce structured sparsity in statistical learning, characterized by its ability to eliminate predefined groups of parameters automatically. However, when the groups overlap, solving the group lasso problem can be time-consuming in high-dimensional settings due to groups’ non-separability. This computational challenge has limited the applicability of the overlapping group lasso penalty in cutting-edge areas, such as gene pathway selection and graphical model estimation. This paper introduces a non-overlapping and separable penalty designed to efficiently approximate the overlapping group lasso penalty. The approximation substantially enhances the computational efficiency in optimization, especially for large-scale and high-dimensional problems. We show that the proposed penalty is the tightest separable relaxation of the overlapping group lasso norm within the family of $\ell_{q_1}/\ell_{q_2}$ norms. Moreover, the estimators derived from our proposed norm are statistically equivalent to those derived from the overlapping group lasso penalty in terms of estimation error, support recovery, and minimax rate under the squared loss. The effectiveness of our method is demonstrated through extensive simulation examples and a predictive task of cancer tumors.
Faster Rates of Differentially Private Stochastic Convex Optimization
http://jmlr.org/papers/v25/22-0079.html
http://jmlr.org/papers/volume25/22-0079/22-0079.pdf
2024Jinyan Su, Lijie Hu, Di Wang
In this paper, we revisit the problem of Differentially Private Stochastic Convex Optimization (DP-SCO) and provide excess population risks for some special classes of functions that are faster than the previous results of general convex and strongly convex functions. In the first part of the paper, we study the case where the population risk function satisfies the Tysbakov Noise Condition (TNC) with some parameter $\theta>1$. Specifically, we first show that under some mild assumptions on the loss functions, there is an algorithm whose output could achieve an upper bound of $\tilde{O}((\frac{1}{\sqrt{n}}+\frac{d}{n\epsilon})^\frac{\theta}{\theta-1}) $ and $\tilde{O}((\frac{1}{\sqrt{n}}+\frac{\sqrt{d\log(1/\delta)}}{n\epsilon})^\frac{\theta}{\theta-1})$ for $\epsilon$-DP and $(\epsilon, \delta)$-DP, respectively when $\theta\geq 2$, where $n$ is the sample size and $d$ is the dimension of the space. Then we address the inefficiency issue, improve the upper bounds by $\text{Poly}(\log n)$ factors and extend to the case where $\theta\geq \bar{\theta}>1$ for some known $\bar{\theta}$. Next, we show that the excess population risk of population functions satisfying TNC with parameter $\theta\geq 2$ is always lower bounded by $\Omega((\frac{d}{n\epsilon})^\frac{\theta}{\theta-1}) $ and $\Omega((\frac{\sqrt{d\log(1/\delta)}}{n\epsilon})^\frac{\theta}{\theta-1})$ for $\epsilon$-DP and $(\epsilon, \delta)$-DP, respectively, which matches our upper bounds. In the second part, we focus on a special case where the population risk function is strongly convex. Unlike the previous studies, here we assume the loss function is non-negative and the optimal value of population risk is sufficiently small. With these additional assumptions, we propose a new method whose output could achieve an upper bound of $O(\frac{d\log(1/\delta)}{n^2\epsilon^2}+\frac{1}{n^{\tau}})$ and $O(\frac{d^2}{n^2\epsilon^2}+\frac{1}{n^{\tau}})$ for any $\tau> 1$ in $(\epsilon,\delta)$-DP and $\epsilon$-DP model respectively if the sample size $n$ is sufficiently large. These results circumvent their corresponding lower bounds in (Feldman et al., 2020) for general strongly convex functions. Finally, we conduct experiments of our new methods on real-world data. Experimental results also provide new insights into established theories.
Nonasymptotic analysis of Stochastic Gradient Hamiltonian Monte Carlo under local conditions for nonconvex optimization
http://jmlr.org/papers/v25/21-1423.html
http://jmlr.org/papers/volume25/21-1423/21-1423.pdf
2024O. Deniz Akyildiz, Sotirios Sabanis
We provide a nonasymptotic analysis of the convergence of the stochastic gradient Hamiltonian Monte Carlo (SGHMC) to a target measure in Wasserstein-2 distance without assuming log-concavity. Our analysis quantifies key theoretical properties of the SGHMC as a sampler under local conditions which significantly improves the findings of previous results. In particular, we prove that the Wasserstein-2 distance between the target and the law of the SGHMC is uniformly controlled by the step-size of the algorithm, therefore demonstrate that the SGHMC can provide high-precision results uniformly in the number of iterations. The analysis also allows us to obtain nonasymptotic bounds for nonconvex optimization problems under local conditions and implies that the SGHMC, when viewed as a nonconvex optimizer, converges to a global minimum with the best known rates. We apply our results to obtain nonasymptotic bounds for scalable Bayesian inference and nonasymptotic generalization bounds.
Finite-time Analysis of Globally Nonstationary Multi-Armed Bandits
http://jmlr.org/papers/v25/21-0916.html
http://jmlr.org/papers/volume25/21-0916/21-0916.pdf
2024Junpei Komiyama, Edouard Fouché, Junya Honda
We consider nonstationary multi-armed bandit problems where the model parameters of the arms change over time. We introduce the adaptive resetting bandit (ADR-bandit), a bandit algorithm class that leverages adaptive windowing techniques from literature on data streams. We first provide new guarantees on the quality of estimators resulting from adaptive windowing techniques, which are of independent interest. Furthermore, we conduct a finite-time analysis of ADR-bandit in two typical environments: an abrupt environment where changes occur instantaneously and a gradual environment where changes occur progressively. We demonstrate that ADR-bandit has nearly optimal performance when abrupt or gradual changes occur in a coordinated manner that we call global changes. We demonstrate that forced exploration is unnecessary when we assume such global changes. Unlike the existing nonstationary bandit algorithms, ADR-bandit has optimal performance in stationary environments as well as nonstationary environments with global changes. Our experiments show that the proposed algorithms outperform the existing approaches in synthetic and real-world environments.
Stable Implementation of Probabilistic ODE Solvers
http://jmlr.org/papers/v25/20-1423.html
http://jmlr.org/papers/volume25/20-1423/20-1423.pdf
2024Nicholas Krämer, Philipp Hennig
Probabilistic solvers for ordinary differential equations (ODEs) provide efficient quantification of numerical uncertainty associated with the simulation of dynamical systems. Their convergence rates have been established by a growing body of theoretical analysis. However, these algorithms suffer from numerical instability when run at high order or with small step sizes---that is, exactly in the regime in which they achieve the highest accuracy. The present work proposes and examines a solution to this problem. It involves three components: accurate initialisation, a coordinate change preconditioner that makes numerical stability concerns step-size-independent, and square-root implementation. Using all three techniques enables numerical computation of probabilistic solutions of ODEs with algorithms of order up to 11, as demonstrated on a set of challenging test problems. The resulting rapid convergence is shown to be competitive with high-order, state-of-the-art, classical methods. As a consequence, a barrier between analysing probabilistic ODE solvers and applying them to interesting machine learning problems is effectively removed.
More PAC-Bayes bounds: From bounded losses, to losses with general tail behaviors, to anytime validity
http://jmlr.org/papers/v25/23-1360.html
http://jmlr.org/papers/volume25/23-1360/23-1360.pdf
2024Borja Rodríguez-Gálvez, Ragnar Thobaben, Mikael Skoglund
In this paper, we present new high-probability PAC-Bayes bounds for different types of losses. Firstly, for losses with a bounded range, we recover a strengthened version of Catoni's bound that holds uniformly for all parameter values. This leads to new fast-rate and mixed-rate bounds that are interpretable and tighter than previous bounds in the literature. In particular, the fast-rate bound is equivalent to the Seeger--Langford bound. Secondly, for losses with more general tail behaviors, we introduce two new parameter-free bounds: a PAC-Bayes Chernoff analogue when the loss' cumulative generating function is bounded, and a bound when the loss' second moment is bounded. These two bounds are obtained using a new technique based on a discretization of the space of possible events for the "in probability" parameter optimization problem. This technique is both simpler and more general than previous approaches optimizing over a grid on the parameters' space. Finally, using a simple technique that is applicable to any existing bound, we extend all previous results to anytime-valid bounds.
Neural Hilbert Ladders: Multi-Layer Neural Networks in Function Space
http://jmlr.org/papers/v25/23-1225.html
http://jmlr.org/papers/volume25/23-1225/23-1225.pdf
2024Zhengdao Chen
To characterize the function space explored by neural networks (NNs) is an important aspect of learning theory. In this work, noticing that a multi-layer NN generates implicitly a hierarchy of reproducing kernel Hilbert spaces (RKHSs) -named a neural Hilbert ladder (NHL) - we define the function space as an infinite union of RKHSs, which generalizes the existing Barron space theory of two-layer NNs. We then establish several theoretical properties of the new space. First, we prove a correspondence between functions expressed by L-layer NNs and those belonging to L-level NHLs. Second, we prove generalization guarantees for learning an NHL with a controlled complexity measure. Third, we derive a non-Markovian dynamics of random fields that governs the evolution of the NHL which is induced by the training of multi-layer NNs in an infinite-width mean-field limit. Fourth, we show examples of depth separation in NHLs under the ReLU activation function. Finally, we perform numerical experiments to illustrate the feature learning aspect of NN training through the lens of NHLs.
QDax: A Library for Quality-Diversity and Population-based Algorithms with Hardware Acceleration
http://jmlr.org/papers/v25/23-1027.html
http://jmlr.org/papers/volume25/23-1027/23-1027.pdf
2024Felix Chalumeau, Bryan Lim, Raphaël Boige, Maxime Allard, Luca Grillotti, Manon Flageat, Valentin Macé, Guillaume Richard, Arthur Flajolet, Thomas Pierrot, Antoine Cully
QDax is an open-source library with a streamlined and modular API for Quality-Diversity (QD) optimisation algorithms in Jax. The library serves as a versatile tool for optimisation purposes, ranging from black-box optimisation to continuous control. QDax offers implementations of popular QD, Neuroevolution, and Reinforcement Learning (RL) algorithms, supported by various examples. All the implementations can be just-in-time compiled with Jax, facilitating efficient execution across multiple accelerators, including GPUs and TPUs. These implementations effectively demonstrate the framework's flexibility and user-friendliness, easing experimentation for research purposes. Furthermore, the library is thoroughly documented and has 93% test coverage.
Random Forest Weighted Local Fr{{\'e}}chet Regression with Random Objects
http://jmlr.org/papers/v25/23-0811.html
http://jmlr.org/papers/volume25/23-0811/23-0811.pdf
2024Rui Qiu, Zhou Yu, Ruoqing Zhu
Statistical analysis is increasingly confronted with complex data from metric spaces. Petersen and Müller (2019) established a general paradigm of Fréchet regression with complex metric space valued responses and Euclidean predictors. However, the local approach therein involves nonparametric kernel smoothing and suffers from the curse of dimensionality. To address this issue, we in this paper propose a novel random forest weighted local Fréchet regression paradigm. The main mechanism of our approach relies on a locally adaptive kernel generated by random forests. Our first method uses these weights as the local average to solve the conditional Fréchet mean, while the second method performs local linear Fréchet regression, both significantly improving existing Fréchet regression methods. Based on the theory of infinite order U-processes and infinite order $M_{m_n}$-estimator, we establish the consistency, rate of convergence, and asymptotic normality for our local constant estimator, which covers the current large sample theory of random forests with Euclidean responses as a special case. Numerical studies show the superiority of our methods with several commonly encountered types of responses such as distribution functions, symmetric positive-definite matrices, and sphere data. The practical merits of our proposals are also demonstrated through the application to New York taxi data and human mortality data.
PhAST: Physics-Aware, Scalable, and Task-Specific GNNs for Accelerated Catalyst Design
http://jmlr.org/papers/v25/23-0680.html
http://jmlr.org/papers/volume25/23-0680/23-0680.pdf
2024Alexandre Duval, Victor Schmidt, Santiago Miret, Yoshua Bengio, Alex Hernández-García, David Rolnick
Mitigating the climate crisis requires a rapid transition towards lower-carbon energy. Catalyst materials play a crucial role in the electrochemical reactions involved in numerous industrial processes key to this transition, such as renewable energy storage and electrofuel synthesis. To reduce the energy spent on such activities, we must quickly discover more efficient catalysts to drive electrochemical reactions. Machine learning (ML) holds the potential to efficiently model materials properties from large amounts of data, accelerating electrocatalyst design. The Open Catalyst Project OC20 dataset was constructed to that end. However, ML models trained on OC20 are still neither scalable nor accurate enough for practical applications. In this paper, we propose task-specific innovations applicable to most architectures, enhancing both computational efficiency and accuracy. This includes improvements in (1) the graph creation step, (2) atom representations, (3) the energy prediction head, and (4) the force prediction head. We describe these contributions, referred to as PhAST, and evaluate them thoroughly on multiple architectures. Overall, PhAST improves energy MAE by 4 to 42% while dividing compute time by 3 to 8× depending on the targeted task/model. PhAST also enables CPU training, leading to 40× speedups in highly parallelized settings. Python package: https://phast.readthedocs.io.
Unsupervised Anomaly Detection Algorithms on Real-world Data: How Many Do We Need?
http://jmlr.org/papers/v25/23-0570.html
http://jmlr.org/papers/volume25/23-0570/23-0570.pdf
2024Roel Bouman, Zaharah Bukhsh, Tom Heskes
In this study we evaluate 33 unsupervised anomaly detection algorithms on 52 real-world multivariate tabular data sets, performing the largest comparison of unsupervised anomaly detection algorithms to date. On this collection of data sets, the EIF (Extended Isolation Forest) algorithm significantly outperforms the most other algorithms. Visualizing and then clustering the relative performance of the considered algorithms on all data sets, we identify two clear clusters: one with "local” data sets, and another with "global” data sets. "Local” anomalies occupy a region with low density when compared to nearby samples, while "global” occupy an overall low density region in the feature space. On the local data sets the $k$NN ($k$-nearest neighbor) algorithm comes out on top. On the global data sets, the EIF (extended isolation forest) algorithm performs the best. Also taking into consideration the algorithms' computational complexity, a toolbox with these two unsupervised anomaly detection algorithms suffices for finding anomalies in this representative collection of multivariate data sets. By providing access to code and data sets, our study can be easily reproduced and extended with more algorithms and/or data sets.
Multi-class Probabilistic Bounds for Majority Vote Classifiers with Partially Labeled Data
http://jmlr.org/papers/v25/23-0121.html
http://jmlr.org/papers/volume25/23-0121/23-0121.pdf
2024Vasilii Feofanov, Emilie Devijver, Massih-Reza Amini
In this paper, we propose a probabilistic framework for analyzing a multi-class majority vote classifier in the case where training data is partially labeled. First, we derive a multi-class transductive bound over the risk of the majority vote classifier, which is based on the classifier's vote distribution over each class. Then, we introduce a mislabeling error model to analyze the error of the majority vote classifier in the case of the pseudo-labeled training data. We derive a generalization bound over the majority vote error when imperfect labels are given, taking into account the mean and the variance of the prediction margin. Finally, we demonstrate an application of the derived transductive bound for self-training to find automatically the confidence threshold used to determine unlabeled examples for pseudo-labeling. Empirical results on different data sets show the effectiveness of our framework compared to several state-of-the-art semi-supervised approaches.
Information Processing Equalities and the Information–Risk Bridge
http://jmlr.org/papers/v25/22-0988.html
http://jmlr.org/papers/volume25/22-0988/22-0988.pdf
2024Robert C. Williamson, Zac Cranko
We introduce two new classes of measures of information for statistical experiments which generalise and subsume φ-divergences, integral probability metrics, N-distances (MMD), and (f,Γ) divergences between two or more distributions. This enables us to derive a simple geometrical relationship between measures of information and the Bayes risk of a statistical decision problem, thus extending the variational φ-divergence representation to multiple distributions in an entirely symmetric manner. The new families of divergence are closed under the action of Markov operators which yields an information processing equality which is a refinement and generalisation of the classical information processing inequality. This equality gives insight into the significance of the choice of the hypothesis class in classical risk minimization.
Nonparametric Regression for 3D Point Cloud Learning
http://jmlr.org/papers/v25/22-0735.html
http://jmlr.org/papers/volume25/22-0735/22-0735.pdf
2024Xinyi Li, Shan Yu, Yueying Wang, Guannan Wang, Li Wang, Ming-Jun Lai
In recent years, there has been an exponentially increased amount of point clouds collected with irregular shapes in various areas. Motivated by the importance of solid modeling for point clouds, we develop a novel and efficient smoothing tool based on multivariate splines over the triangulation to extract the underlying signal and build up a 3D solid model from the point cloud. The proposed method can denoise or deblur the point cloud effectively, provide a multi-resolution reconstruction of the actual signal, and handle sparse and irregularly distributed point clouds to recover the underlying trajectory. In addition, our method provides a natural way of numerosity data reduction. We establish the theoretical guarantees of the proposed method, including the convergence rate and asymptotic normality of the estimator, and show that the convergence rate achieves optimal nonparametric convergence. We also introduce a bootstrap method to quantify the uncertainty of the estimators. Through extensive simulation studies and a real data example, we demonstrate the superiority of the proposed method over traditional smoothing methods in terms of estimation accuracy and efficiency of data reduction.
AMLB: an AutoML Benchmark
http://jmlr.org/papers/v25/22-0493.html
http://jmlr.org/papers/volume25/22-0493/22-0493.pdf
2024Pieter Gijsbers, Marcos L. P. Bueno, Stefan Coors, Erin LeDell, Sébastien Poirier, Janek Thomas, Bernd Bischl, Joaquin Vanschoren
Comparing different AutoML frameworks is notoriously challenging and often done incorrectly. We introduce an open and extensible benchmark that follows best practices and avoids common mistakes when comparing AutoML frameworks. We conduct a thorough comparison of 9 well-known AutoML frameworks across 71 classification and 33 regression tasks. The differences between the AutoML frameworks are explored with a multi-faceted analysis, evaluating model accuracy, its trade-offs with inference time, and framework failures. We also use Bradley-Terry trees to discover subsets of tasks where the relative AutoML framework rankings differ. The benchmark comes with an open-source tool that integrates with many AutoML frameworks and automates the empirical evaluation process end-to-end: from framework installation and resource allocation to in-depth evaluation. The benchmark uses public data sets, can be easily extended with other AutoML frameworks and tasks, and has a website with up-to-date results.
Materials Discovery using Max K-Armed Bandit
http://jmlr.org/papers/v25/22-0186.html
http://jmlr.org/papers/volume25/22-0186/22-0186.pdf
2024Nobuaki Kikkawa, Hiroshi Ohno
Search algorithms for bandit problems are applicable in materials discovery. However, objectives of the conventional bandit problem are different from those of materials discovery. The conventional bandit problem aims to maximize the total rewards, whereas materials discovery aims to achieve breakthroughs in material properties. The max $K$-armed bandit (MKB) problem, which aims to acquire the single best reward, matches with the discovery tasks better than the conventional bandit. However, typical MKB algorithms are not directly applicable to materials discovery due to some difficulties. The typical algorithms have many hyperparameters and some difficulty in the directly implementation for the materials discovery. Thus, we propose a new MKB algorithm using an upper confidence bound of expected improvement of the best reward. This approach is guaranteed to be asymptotic to greedy oracles, which does not depend on the time horizon. In addition, compared with other MKB algorithms, the proposed algorithm has only one hyperparameter, which is advantageous in materials discovery. We applied the proposed algorithm to synthetic problems and molecular-design demonstrations using a Monte Carlo tree search. According to the results, the proposed algorithm stably outperformed other bandit algorithms in the late stage of the search process, unless the optimal arm coincides in the MKB and conventional bandit settings.
Semi-supervised Inference for Block-wise Missing Data without Imputation
http://jmlr.org/papers/v25/21-1504.html
http://jmlr.org/papers/volume25/21-1504/21-1504.pdf
2024Shanshan Song, Yuanyuan Lin, Yong Zhou
We consider statistical inference for single or low-dimensional parameters in a high-dimensional linear model under a semi-supervised setting, wherein the data are a combination of a labelled block-wise missing data set of a relatively small size and a large unlabelled data set. The proposed method utilises both labelled and unlabelled data without any imputation or removal of the missing observations. The asymptotic properties of the estimator are established under regularity conditions. Hypothesis testing for low-dimensional coefficients are also studied. Extensive simulations are conducted to examine the theoretical results. The method is evaluated on the Alzheimer’s Disease Neuroimaging Initiative data.
Adaptivity and Non-stationarity: Problem-dependent Dynamic Regret for Online Convex Optimization
http://jmlr.org/papers/v25/21-0748.html
http://jmlr.org/papers/volume25/21-0748/21-0748.pdf
2024Peng Zhao, Yu-Jie Zhang, Lijun Zhang, Zhi-Hua Zhou
We investigate online convex optimization in non-stationary environments and choose dynamic regret as the performance measure, defined as the difference between cumulative loss incurred by the online algorithm and that of any feasible comparator sequence. Let $T$ be the time horizon and $P_T$ be the path length that essentially reflects the non-stationarity of environments, the state-of-the-art dynamic regret is $\mathcal{O}(\sqrt{T(1+P_T)})$. Although this bound is proved to be minimax optimal for convex functions, in this paper, we demonstrate that it is possible to further enhance the guarantee for some easy problem instances, particularly when online functions are smooth. Specifically, we introduce novel online algorithms that can exploit smoothness and replace the dependence on $T$ in dynamic regret with problem-dependent quantities: the variation in gradients of loss functions, the cumulative loss of the comparator sequence, and the minimum of these two terms. These quantities are at most $\mathcal{O}(T)$ while could be much smaller in benign environments. Therefore, our results are adaptive to the intrinsic difficulty of the problem, since the bounds are tighter than existing results for easy problems and meanwhile safeguard the same rate in the worst case. Notably, our proposed algorithms can achieve favorable dynamic regret with only one gradient per iteration, sharing the same gradient query complexity as the static regret minimization methods. To accomplish this, we introduce the collaborative online ensemble framework. The proposed framework employs a two-layer online ensemble to handle non-stationarity, and uses optimistic online learning and further introduces crucial correction terms to enable effective collaboration within the meta-base two layers, thereby attaining adaptivity. We believe the framework can be useful for broader problems.
Scaling Speech Technology to 1,000+ Languages
http://jmlr.org/papers/v25/23-1318.html
http://jmlr.org/papers/volume25/23-1318/23-1318.pdf
2024Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, Michael Auli
Expanding the language coverage of speech technology has the potential to improve access to information for many more people. However, current speech technology is restricted to about one hundred languages which is a small fraction of the over 7,000 languages spoken around the world. The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task while providing improved accuracy compared to prior work. The main ingredients are a new dataset based on readings of publicly available religious texts and effectively leveraging self-supervised learning. We built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, as well as a language identification model for 4,017 languages. Experiments show that our multilingual speech recognition model more than halves the word error rate of Whisper on 54 languages of the FLEURS benchmark while being trained on a small fraction of the labeled data.
MAP- and MLE-Based Teaching
http://jmlr.org/papers/v25/23-1086.html
http://jmlr.org/papers/volume25/23-1086/23-1086.pdf
2024Hans Ulrich Simon, Jan Arne Telle
Imagine a learner $L$ who tries to infer a hidden concept from a collection of observations. Building on the work of Ferri et al we assume the learner to be parameterized by priors $P(c)$ and by $c$-conditional likelihoods $P(z|c)$ where $c$ ranges over all concepts in a given class $C$ and $z$ ranges over all observations in an observation set $Z$. $L$ is called a MAP-learner (resp.~an MLE-learner) if it thinks of a collection $S$ of observations as a random sample and returns the concept with the maximum a-posteriori probability (resp.~the concept which maximizes the $c$-conditional likelihood of $S$). Depending on whether $L$ assumes that $S$ is obtained from ordered or unordered sampling resp.~from sampling with or without replacement, we can distinguish four different sampling modes. Given a target concept $c^* \in C$, a teacher for a MAP-learner $L$ aims at finding a smallest collection of observations that causes $L$ to return $c^*$. This approach leads in a natural manner to various notions of a MAP- or MLE-teaching dimension of a concept class $C$. Our main results are as follows. First, we show that this teaching model has some desirable monotonicity properties. Second we clarify how the four sampling modes are related to each other. As for the (important!) special case, where concepts are subsets of a domain and observations are 0,1-labeled examples, we obtain some additional results. First of all, we characterize the MAP- and MLE-teaching dimension associated with an optimally parameterized MAP-learner graph-theoretically. From this central result, some other ones are easy to derive. It is shown, for instance, that the MLE-teaching dimension is either equal to the MAP-teaching dimension or exceeds the latter by $1$. It is shown furthermore that these dimensions can be bounded from above by the so-called antichain number, the VC-dimension and related combinatorial parameters. Moreover they can be computed in polynomial time.
A General Framework for the Analysis of Kernel-based Tests
http://jmlr.org/papers/v25/23-0985.html
http://jmlr.org/papers/volume25/23-0985/23-0985.pdf
2024Tamara Fernández, Nicolás Rivera
Kernel-based tests provide a simple yet effective framework that uses the theory of reproducing kernel Hilbert spaces to design non-parametric testing procedures. In this paper, we propose new theoretical tools that can be used to study the asymptotic behaviour of kernel-based tests in various data scenarios and in different testing problems. Unlike current approaches, our methods avoid working with U and V-statistics expansions that usually lead to lengthy and tedious computations and asymptotic approximations. Instead, we work directly with random functionals on the Hilbert space to analyse kernel-based tests. By harnessing the use of random functionals, our framework leads to much cleaner analyses, involving less tedious computations. Additionally, it offers the advantage of accommodating pre-existing knowledge regarding test-statistics as many of the random functionals considered in applications are known statistics that have been studied comprehensively. To demonstrate the efficacy of our approach, we thoroughly examine two categories of kernel tests, along with three specific examples of kernel tests, including a novel kernel test for conditional independence testing.
Overparametrized Multi-layer Neural Networks: Uniform Concentration of Neural Tangent Kernel and Convergence of Stochastic Gradient Descent
http://jmlr.org/papers/v25/23-0740.html
http://jmlr.org/papers/volume25/23-0740/23-0740.pdf
2024Jiaming Xu, Hanjing Zhu
There have been exciting progresses in understanding the convergence of gradient descent (GD) and stochastic gradient descent (SGD) in overparameterized neural networks through the lens of neural tangent kernel (NTK). However, there remain two significant gaps between theory and practice. First, the existing convergence theory only takes into account the contribution of the NTK from the last hidden layer, while in practice the intermediate layers also play an instrumental role. Second, most existing works assume that the training data are provided a priori in a batch, while less attention has been paid to the important setting where the training data arrive in a stream. In this paper, we close these two gaps. We first show that with random initialization, the NTK function converges to some deterministic function uniformly for all layers as the number of neurons tends to infinity. Then we apply the uniform convergence result to further prove that the prediction error of multi-layer neural networks under SGD converges in expectation in the streaming data setting. A key ingredient in our proof is to show the number of activation patterns of an $L$-layer neural network with width $m$ is only polynomial in $m$ although there are $mL$ neurons in total.
Sparse Representer Theorems for Learning in Reproducing Kernel Banach Spaces
http://jmlr.org/papers/v25/23-0645.html
http://jmlr.org/papers/volume25/23-0645/23-0645.pdf
2024Rui Wang, Yuesheng Xu, Mingsong Yan
Sparsity of a learning solution is a desirable feature in machine learning. Certain reproducing kernel Banach spaces (RKBSs) are appropriate hypothesis spaces for sparse learning methods. The goal of this paper is to understand what kind of RKBSs can promote sparsity for learning solutions. We consider two typical learning models in an RKBS: the minimum norm interpolation (MNI) problem and the regularization problem. We first establish an explicit representer theorem for solutions of these problems, which represents the extreme points of the solution set by a linear combination of the extreme points of the subdifferential set, of the norm function, which is data-dependent. We then propose sufficient conditions on the RKBS that can transform the explicit representation of the solutions to a sparse kernel representation having fewer terms than the number of the observed data. Under the proposed sufficient conditions, we investigate the role of the regularization parameter on sparsity of the regularized solutions. We further show that two specific RKBSs, the sequence space $\ell_1(\mathbb{N})$ and the measure space, can have sparse representer theorems for both MNI and regularization models.
Exploration of the Search Space of Gaussian Graphical Models for Paired Data
http://jmlr.org/papers/v25/23-0295.html
http://jmlr.org/papers/volume25/23-0295/23-0295.pdf
2024Alberto Roverato, Dung Ngoc Nguyen
We consider the problem of learning a Gaussian graphical model in the case where the observations come from two dependent groups sharing the same variables. We focus on a family of coloured Gaussian graphical models specifically suited for the paired data problem. Commonly, graphical models are ordered by the submodel relationship so that the search space is a lattice, called the model inclusion lattice. We introduce a novel order between models, named the twin order. We show that, embedded with this order, the model space is a lattice that, unlike the model inclusion lattice, is distributive. Furthermore, we provide the relevant rules for the computation of the neighbours of a model. The latter are more efficient than the same operations in the model inclusion lattice, and are then exploited to achieve a more efficient exploration of the search space. These results can be applied to improve the efficiency of both greedy and Bayesian model search procedures. Here, we implement a stepwise backward elimination procedure and evaluate its performance both on synthetic and real-world data.
The good, the bad and the ugly sides of data augmentation: An implicit spectral regularization perspective
http://jmlr.org/papers/v25/22-1312.html
http://jmlr.org/papers/volume25/22-1312/22-1312.pdf
2024Chi-Heng Lin, Chiraag Kaushik, Eva L. Dyer, Vidya Muthukumar
Data augmentation (DA) is a powerful workhorse for bolstering performance in modern machine learning. Specific augmentations like translations and scaling in computer vision are traditionally believed to improve generalization by generating new (artificial) data from the same distribution. However, this traditional viewpoint does not explain the success of prevalent augmentations in modern machine learning (e.g. randomized masking, cutout, mixup), that greatly alter the training data distribution. In this work, we develop a new theoretical framework to characterize the impact of a general class of DA on underparameterized and overparameterized linear model generalization. Our framework reveals that DA induces implicit spectral regularization through a combination of two distinct effects: a) manipulating the relative proportion of eigenvalues of the data covariance matrix in a training-data-dependent manner, and b) uniformly boosting the entire spectrum of the data covariance matrix through ridge regression. These effects, when applied to popular augmentations, give rise to a wide variety of phenomena, including discrepancies in generalization between overparameterized and underparameterized regimes and differences between regression and classification tasks. Our framework highlights the nuanced and sometimes surprising impacts of DA on generalization, and serves as a testbed for novel augmentation design.
Stochastic Approximation with Decision-Dependent Distributions: Asymptotic Normality and Optimality
http://jmlr.org/papers/v25/22-0832.html
http://jmlr.org/papers/volume25/22-0832/22-0832.pdf
2024Joshua Cutler, Mateo Díaz, Dmitriy Drusvyatskiy
We analyze a stochastic approximation algorithm for decision-dependent problems, wherein the data distribution used by the algorithm evolves along the iterate sequence. The primary examples of such problems appear in performative prediction and its multiplayer extensions. We show that under mild assumptions, the deviation between the average iterate of the algorithm and the solution is asymptotically normal, with a covariance that clearly decouples the effects of the gradient noise and the distributional shift. Moreover, building on the work of Hájek and Le Cam, we show that the asymptotic performance of the algorithm with averaging is locally minimax optimal.
Minimax Rates for High-Dimensional Random Tessellation Forests
http://jmlr.org/papers/v25/22-0673.html
http://jmlr.org/papers/volume25/22-0673/22-0673.pdf
2024Eliza O'Reilly, Ngoc Mai Tran
Random forests are a popular class of algorithms used for regression and classification. The algorithm introduced by Breiman in 2001 and many of its variants are ensembles of randomized decision trees built from axis-aligned partitions of the feature space. One such variant, called Mondrian forests, was proposed to handle the online setting and is the first class of random forests for which minimax optimal rates were obtained in arbitrary dimension. However, the restriction to axis-aligned splits fails to capture dependencies between features, and random forests that use oblique splits have shown improved empirical performance for many tasks. This work shows that a large class of random forests with general split directions also achieve minimax optimal rates in arbitrary dimension. This class includes STIT forests, a generalization of Mondrian forests to arbitrary split directions, and random forests derived from Poisson hyperplane tessellations. These are the first results showing that random forest variants with oblique splits can obtain minimax optimality in arbitrary dimension. Our proof technique relies on the novel application of the theory of stationary random tessellations in stochastic geometry to statistical learning theory.
Nonparametric Estimation of Non-Crossing Quantile Regression Process with Deep ReQU Neural Networks
http://jmlr.org/papers/v25/22-0488.html
http://jmlr.org/papers/volume25/22-0488/22-0488.pdf
2024Guohao Shen, Yuling Jiao, Yuanyuan Lin, Joel L. Horowitz, Jian Huang
We propose a penalized nonparametric approach to estimating the quantile regression process (QRP) in a nonseparable model using rectifier quadratic unit (ReQU) activated deep neural networks and introduce a novel penalty function to enforce non-crossing of quantile regression curves. We establish the non-asymptotic excess risk bounds for the estimated QRP and derive the mean integrated squared error for the estimated QRP under mild smoothness and regularity conditions. To establish these non-asymptotic risk and estimation error bounds, we also develop a new error bound for approximating $C^s$ smooth functions with $s >1$ and their derivatives using ReQU activated neural networks. This is a new approximation result for ReQU networks and is of independent interest and may be useful in other problems. Our numerical experiments demonstrate that the proposed method is competitive with or outperforms two existing methods, including methods using reproducing kernels and random forests for nonparametric quantile regression.
Spatial meshing for general Bayesian multivariate models
http://jmlr.org/papers/v25/22-0083.html
http://jmlr.org/papers/volume25/22-0083/22-0083.pdf
2024Michele Peruzzi, David B. Dunson
Quantifying spatial and/or temporal associations in multivariate geolocated data of different types is achievable via spatial random effects in a Bayesian hierarchical model, but severe computational bottlenecks arise when spatial dependence is encoded as a latent Gaussian process (GP) in the increasingly common large scale data settings on which we focus. The scenario worsens in non-Gaussian models because the reduced analytical tractability leads to additional hurdles to computational efficiency. In this article, we introduce Bayesian models of spatially referenced data in which the likelihood or the latent process (or both) are not Gaussian. First, we exploit the advantages of spatial processes built via directed acyclic graphs, in which case the spatial nodes enter the Bayesian hierarchy and lead to posterior sampling via routine Markov chain Monte Carlo (MCMC) methods. Second, motivated by the possible inefficiencies of popular gradient-based sampling approaches in the multivariate contexts on which we focus, we introduce the simplified manifold preconditioner adaptation (SiMPA) algorithm which uses second order information about the target but avoids expensive matrix operations. We demostrate the performance and efficiency improvements of our methods relative to alternatives in extensive synthetic and real world remote sensing and community ecology applications with large scale data at up to hundreds of thousands of spatial locations and up to tens of outcomes. Software for the proposed methods is part of R package meshed, available on CRAN.
A Semi-parametric Estimation of Personalized Dose-response Function Using Instrumental Variables
http://jmlr.org/papers/v25/21-1181.html
http://jmlr.org/papers/volume25/21-1181/21-1181.pdf
2024Wei Luo, Yeying Zhu, Xuekui Zhang, Lin Lin
In the application of instrumental variable analysis that conducts causal inference in the presence of unmeasured confounding, invalid instrumental variables and weak instrumental variables often exist which complicate the analysis. In this paper, we propose a model-free dimension reduction procedure to select the invalid instrumental variables and refine them into lower-dimensional linear combinations. The procedure also combines the weak instrumental variables into a few stronger instrumental variables that best condense their information. We then introduce the personalized dose-response function that incorporates the subject's personal characteristics into the conventional dose-response function, and use the reduced data from dimension reduction to propose a novel and easily implementable nonparametric estimator of this function. The proposed approach is suitable for both discrete and continuous treatment variables, and is robust to the dimensionality of data. Its effectiveness is illustrated by the simulation studies and the data analysis of ADNI-DoD study, where the causal relationship between depression and dementia is investigated.
Learning Non-Gaussian Graphical Models via Hessian Scores and Triangular Transport
http://jmlr.org/papers/v25/21-0022.html
http://jmlr.org/papers/volume25/21-0022/21-0022.pdf
2024Ricardo Baptista, Rebecca Morrison, Olivier Zahm, Youssef Marzouk
Undirected probabilistic graphical models represent the conditional dependencies, or Markov properties, of a collection of random variables. Knowing the sparsity of such a graphical model is valuable for modeling multivariate distributions and for efficiently performing inference. While the problem of learning graph structure from data has been studied extensively for certain parametric families of distributions, most existing methods fail to consistently recover the graph structure for non-Gaussian data. Here we propose an algorithm for learning the Markov structure of continuous and non-Gaussian distributions. To characterize conditional independence, we introduce a score based on integrated Hessian information from the joint log-density, and we prove that this score upper bounds the conditional mutual information for a general class of distributions. To compute the score, our algorithm SING estimates the density using a deterministic coupling, induced by a triangular transport map, and iteratively exploits sparse structure in the map to reveal sparsity in the graph. For certain non-Gaussian datasets, we show that our algorithm recovers the graph structure even with a biased approximation to the density. Among other examples, we apply SING to learn the dependencies between the states of a chaotic dynamical system with local interactions.
On the Learnability of Out-of-distribution Detection
http://jmlr.org/papers/v25/23-1257.html
http://jmlr.org/papers/volume25/23-1257/23-1257.pdf
2024Zhen Fang, Yixuan Li, Feng Liu, Bo Han, Jie Lu
Supervised learning aims to train a classifier under the assumption that training and test data are from the same distribution. To ease the above assumption, researchers have studied a more realistic setting: out-of-distribution (OOD) detection, where test data may come from classes that are unknown during training (i.e., OOD data). Due to the unavailability and diversity of OOD data, good generalization ability is crucial for effective OOD detection algorithms, and corresponding learning theory is still an open problem. To study the generalization of OOD detection, this paper investigates the probably approximately correct (PAC) learning theory of OOD detection that fits the commonly used evaluation metrics in the literature. First, we find a necessary condition for the learnability of OOD detection. Then, using this condition, we prove several impossibility theorems for the learnability of OOD detection under some scenarios. Although the impossibility theorems are frustrating, we find that some conditions of these impossibility theorems may not hold in some practical scenarios. Based on this observation, we next give several necessary and sufficient conditions to characterize the learnability of OOD detection in some practical scenarios. Lastly, we offer theoretical support for representative OOD detection works based on our OOD theory.
Win: Weight-Decay-Integrated Nesterov Acceleration for Faster Network Training
http://jmlr.org/papers/v25/23-1073.html
http://jmlr.org/papers/volume25/23-1073/23-1073.pdf
2024Pan Zhou, Xingyu Xie, Zhouchen Lin, Kim-Chuan Toh, Shuicheng Yan
Training deep networks on large-scale datasets is computationally challenging. This work explores the problem of “how to accelerate adaptive gradient algorithms in a general manner", and proposes an effective Weight-decay-Integrated Nesterov acceleration (Win) to accelerate adaptive algorithms. Taking AdamW and Adam as examples, per iteration, we construct a dynamical loss that combines the vanilla training loss and a dynamic regularizer inspired by proximal point method, and respectively minimize the first- and second-order Taylor approximations of dynamical loss to update variable. This yields our Win acceleration that uses a conservative step and an aggressive step to update, and linearly combines these two updates for acceleration. Next, we extend Win into Win2 which uses multiple aggressive update steps for faster convergence. Then we apply Win and Win2 to the popular LAMB and SGD optimizers. Our transparent derivation could provide insights for other accelerated methods and their integration into adaptive algorithms. Besides, we theoretically justify the faster convergence of Win- and Win2-accelerated AdamW, Adam and LAMB to their non-accelerated counterparts. Experimental results demonstrates the faster convergence speed and superior performance of our Win- and Win2-accelerated AdamW, Adam, LAMB and SGD over their vanilla counterparts on vision classification and language modeling tasks.
On the Eigenvalue Decay Rates of a Class of Neural-Network Related Kernel Functions Defined on General Domains
http://jmlr.org/papers/v25/23-0866.html
http://jmlr.org/papers/volume25/23-0866/23-0866.pdf
2024Yicheng Li, Zixiong Yu, Guhan Chen, Qian Lin
In this paper, we provide a strategy to determine the eigenvalue decay rate (EDR) of a large class of kernel functions defined on a general domain rather than $\mathbb{S}^{d}$. This class of kernel functions include but are not limited to the neural tangent kernel associated with neural networks with different depths and various activation functions. After proving that the dynamics of training the wide neural networks uniformly approximated that of the neural tangent kernel regression on general domains, we can further illustrate the minimax optimality of the wide neural network provided that the underground truth function $f\in [\mathcal H_{\mathrm{NTK}}]^{s}$, an interpolation space associated with the RKHS $\mathcal{H}_{\mathrm{NTK}}$ of NTK. We also showed that the overfitted neural network can not generalize well. We believe our approach for determining the EDR of kernels might be also of independent interests.
Tight Convergence Rate Bounds for Optimization Under Power Law Spectral Conditions
http://jmlr.org/papers/v25/23-0698.html
http://jmlr.org/papers/volume25/23-0698/23-0698.pdf
2024Maksim Velikanov, Dmitry Yarotsky
Performance of optimization on quadratic problems sensitively depends on the low-lying part of the spectrum. For large (effectively infinite-dimensional) problems, this part of the spectrum can often be naturally represented or approximated by power law distributions, resulting in power law convergence rates for iterative solutions of these problems by gradient-based algorithms. In this paper, we propose a new spectral condition providing tighter upper bounds for problems with power law optimization trajectories. We use this condition to build a complete picture of upper and lower bounds for a wide range of optimization algorithms - Gradient Descent, Steepest Descent, Heavy Ball, and Conjugate Gradients - with an emphasis on the underlying schedules of learning rate and momentum. In particular, we demonstrate how an optimally accelerated method, its schedule, and convergence upper bound can be obtained in a unified manner for a given shape of the spectrum. Also, we provide first proofs of tight lower bounds for convergence rates of Steepest Descent and Conjugate Gradients under spectral power laws with general exponents. Our experiments show that the obtained convergence bounds and acceleration strategies are not only relevant for exactly quadratic optimization problems, but also fairly accurate when applied to the training of neural networks.
ptwt - The PyTorch Wavelet Toolbox
http://jmlr.org/papers/v25/23-0636.html
http://jmlr.org/papers/volume25/23-0636/23-0636.pdf
2024Moritz Wolter, Felix Blanke, Jochen Garcke, Charles Tapley Hoyt
The fast wavelet transform is an essential workhorse in signal processing. Wavelets are local in the spatial- or temporal- and the frequency-domain. This property enables frequency domain analysis while preserving some spatiotemporal information. Until recently, wavelets rarely appeared in the machine learning literature. We provide the PyTorch Wavelet Toolbox to make wavelet methods more accessible to the deep learning community. Our PyTorch Wavelet Toolbox is well documented. A pip package is installable with `pip install ptwt`.
Choosing the Number of Topics in LDA Models – A Monte Carlo Comparison of Selection Criteria
http://jmlr.org/papers/v25/23-0188.html
http://jmlr.org/papers/volume25/23-0188/23-0188.pdf
2024Victor Bystrov, Viktoriia Naboka-Krell, Anna Staszewska-Bystrova, Peter Winker
Selecting the number of topics in Latent Dirichlet Allocation (LDA) models is considered to be a difficult task, for which various approaches have been proposed. In this paper the performance of the recently developed singular Bayesian information criterion (sBIC) is evaluated and compared to the performance of alternative model selection criteria. The sBIC is a generalization of the standard BIC that can be applied to singular statistical models. The comparison is based on Monte Carlo simulations and carried out for several alternative settings, varying with respect to the number of topics, the number of documents and the size of documents in the corpora. Performance is measured using different criteria which take into account the correct number of topics, but also whether the relevant topics from the considered data generation processes (DGPs) are revealed. Practical recommendations for LDA model selection in applications are derived.
Functional Directed Acyclic Graphs
http://jmlr.org/papers/v25/22-1038.html
http://jmlr.org/papers/volume25/22-1038/22-1038.pdf
2024Kuang-Yao Lee, Lexin Li, Bing Li
In this article, we introduce a new method to estimate a directed acyclic graph (DAG) from multivariate functional data. We build on the notion of faithfulness that relates a DAG with a set of conditional independences among the random functions. We develop two linear operators, the conditional covariance operator and the partial correlation operator, to characterize and evaluate the conditional independence. Based on these operators, we adapt and extend the PC-algorithm to estimate the functional directed graph, so that the computation time depends on the sparsity rather than the full size of the graph. We study the asymptotic properties of the two operators, derive their uniform convergence rates, and establish the uniform consistency of the estimated graph, all of which are obtained while allowing the graph size to diverge to infinity with the sample size. We demonstrate the efficacy of our method through both simulations and an application to a time-course proteomic dataset.
Unlabeled Principal Component Analysis and Matrix Completion
http://jmlr.org/papers/v25/22-0816.html
http://jmlr.org/papers/volume25/22-0816/22-0816.pdf
2024Yunzhen Yao, Liangzu Peng, Manolis C. Tsakiris
We introduce robust principal component analysis from a data matrix in which the entries of its columns have been corrupted by permutations, termed Unlabeled Principal Component Analysis (UPCA). Using algebraic geometry, we establish that UPCA is a well-defined algebraic problem since we prove that the only matrices of minimal rank that agree with the given data are row-permutations of the ground-truth matrix, arising as the unique solutions of a polynomial system of equations. Further, we propose an efficient two-stage algorithmic pipeline for UPCA suitable for the practically relevant case where only a fraction of the data have been permuted. Stage-I employs outlier-robust PCA methods to estimate the ground-truth column-space. Equipped with the column-space, Stage-II applies recent methods for unlabeled sensing to restore the permuted data. Allowing for missing entries on top of permutations in UPCA leads to the problem of unlabeled matrix completion, for which we derive theory and algorithms of similar flavor. Experiments on synthetic data, face images, educational and medical records reveal the potential of our algorithms for applications such as data privatization and record linkage.
Distributed Estimation on Semi-Supervised Generalized Linear Model
http://jmlr.org/papers/v25/22-0670.html
http://jmlr.org/papers/volume25/22-0670/22-0670.pdf
2024Jiyuan Tu, Weidong Liu, Xiaojun Mao
Semi-supervised learning is devoted to using unlabeled data to improve the performance of machine learning algorithms. In this paper, we study the semi-supervised generalized linear model (GLM) in the distributed setup. In the cases of single or multiple machines containing unlabeled data, we propose two distributed semi-supervised algorithms based on the distributed approximate Newton method. When the labeled local sample size is small, our algorithms still give a consistent estimation, while fully supervised methods fail to converge. Moreover, we theoretically prove that the convergence rate is greatly improved when sufficient unlabeled data exists. Therefore, the proposed method requires much fewer rounds of communications to achieve the optimal rate than its fully-supervised counterpart. In the case of the linear model, we prove the rate lower bound after one round of communication, which shows that rate improvement is essential. Finally, several simulation analyses and real data studies are provided to demonstrate the effectiveness of our method.
Towards Explainable Evaluation Metrics for Machine Translation
http://jmlr.org/papers/v25/22-0416.html
http://jmlr.org/papers/volume25/22-0416/22-0416.pdf
2024Christoph Leiter, Piyawat Lertvittayakumjorn, Marina Fomicheva, Wei Zhao, Yang Gao, Steffen Eger
Unlike classical lexical overlap metrics such as BLEU, most current evaluation metrics for machine translation (for example, COMET or BERTScore) are based on black-box large language models. They often achieve strong correlations with human judgments, but recent research indicates that the lower-quality classical metrics remain dominant, one of the potential reasons being that their decision processes are more transparent. To foster more widespread acceptance of novel high-quality metrics, explainability thus becomes crucial. In this concept paper, we identify key properties as well as key goals of explainable machine translation metrics and provide a comprehensive synthesis of recent techniques, relating them to our established goals and properties. In this context, we also discuss the latest state-of-the-art approaches to explainable metrics based on generative models such as ChatGPT and GPT4. Finally, we contribute a vision of next-generation approaches, including natural language explanations. We hope that our work can help catalyze and guide future research on explainable evaluation metrics and, mediately, also contribute to better and more transparent machine translation systems.
Differentially private methods for managing model uncertainty in linear regression
http://jmlr.org/papers/v25/21-1536.html
http://jmlr.org/papers/volume25/21-1536/21-1536.pdf
2024Víctor Peña, Andrés F. Barrientos
In this article, we propose differentially private methods for hypothesis testing, model averaging, and model selection for normal linear models. We propose Bayesian methods based on mixtures of $g$-priors and non-Bayesian methods based on likelihood-ratio statistics and information criteria. The procedures are asymptotically consistent and straightforward to implement with existing software. We focus on practical issues such as adjusting critical values so that hypothesis tests have adequate type I error rates and quantifying the uncertainty introduced by the privacy-ensuring mechanisms.
Data Summarization via Bilevel Optimization
http://jmlr.org/papers/v25/21-1132.html
http://jmlr.org/papers/volume25/21-1132/21-1132.pdf
2024Zalán Borsos, Mojmír Mutný, Marco Tagliasacchi, Andreas Krause
The increasing availability of massive data sets poses various challenges for machine learning. Prominent among these is learning models under hardware or human resource constraints. In such resource-constrained settings, a simple yet powerful approach is operating on small subsets of the data. Coresets are weighted subsets of the data that provide approximation guarantees for the optimization objective. However, existing coreset constructions are highly model-specific and are limited to simple models such as linear regression, logistic regression, and k-means. In this work, we propose a generic coreset construction framework that formulates the coreset selection as a cardinality-constrained bilevel optimization problem. In contrast to existing approaches, our framework does not require model-specific adaptations and applies to any twice differentiable model, including neural networks. We show the effectiveness of our framework for a wide range of models in various settings, including training non-convex models online and batch active learning.
Pareto Smoothed Importance Sampling
http://jmlr.org/papers/v25/19-556.html
http://jmlr.org/papers/volume25/19-556/19-556.pdf
2024Aki Vehtari, Daniel Simpson, Andrew Gelman, Yuling Yao, Jonah Gabry
Importance weighting is a general way to adjust Monte Carlo integration to account for draws from the wrong distribution, but the resulting estimate can be highly variable when the importance ratios have a heavy right tail. This routinely occurs when there are aspects of the target distribution that are not well captured by the approximating distribution, in which case more stable estimates can be obtained by modifying extreme importance ratios. We present a new method for stabilizing importance weights using a generalized Pareto distribution fit to the upper tail of the distribution of the simulated importance ratios. The method, which empirically performs better than existing methods for stabilizing importance sampling estimates, includes stabilized effective sample size estimates, Monte Carlo error estimates, and convergence diagnostics. The presented Pareto $\hat{k}$ finite sample convergence rate diagnostic is useful for any Monte Carlo estimator.
Policy Gradient Methods in the Presence of Symmetries and State Abstractions
http://jmlr.org/papers/v25/23-1415.html
http://jmlr.org/papers/volume25/23-1415/23-1415.pdf
2024Prakash Panangaden, Sahand Rezaei-Shoshtari, Rosie Zhao, David Meger, Doina Precup
Reinforcement learning (RL) on high-dimensional and complex problems relies on abstraction for improved efficiency and generalization. In this paper, we study abstraction in the continuous-control setting, and extend the definition of Markov decision process (MDP) homomorphisms to the setting of continuous state and action spaces. We derive a policy gradient theorem on the abstract MDP for both stochastic and deterministic policies. Our policy gradient results allow for leveraging approximate symmetries of the environment for policy optimization. Based on these theorems, we propose a family of actor-critic algorithms that are able to learn the policy and the MDP homomorphism map simultaneously, using the lax bisimulation metric. Finally, we introduce a series of environments with continuous symmetries to further demonstrate the ability of our algorithm for action abstraction in the presence of such symmetries. We demonstrate the effectiveness of our method on our environments, as well as on challenging visual control tasks from the DeepMind Control Suite. Our method's ability to utilize MDP homomorphisms for representation learning leads to improved performance, and the visualizations of the latent space clearly demonstrate the structure of the learned abstraction.
Scaling Instruction-Finetuned Language Models
http://jmlr.org/papers/v25/23-0870.html
http://jmlr.org/papers/volume25/23-0870/23-0870.pdf
2024Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, Jason Wei
Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation, RealToxicityPrompts). For instance, Flan-PaLM 540B instruction-finetuned on 1.8K tasks outperforms PaLM 540B by a large margin (+9.4% on average). Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks (at time of release), such as 75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints,1 which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B. Overall, instruction finetuning is a general method for improving the performance and usability of pretrained language models.
Tangential Wasserstein Projections
http://jmlr.org/papers/v25/23-0708.html
http://jmlr.org/papers/volume25/23-0708/23-0708.pdf
2024Florian Gunsilius, Meng Hsuan Hsieh, Myung Jin Lee
We develop a notion of projections between sets of probability measures using the geometric properties of the $2$-Wasserstein space. In contrast to existing methods, it is designed for multivariate probability measures that need not be regular, and is computationally efficient to implement via regression. The idea is to work on tangent cones of the Wasserstein space using generalized geodesics. Its structure and computational properties make the method applicable in a variety of settings where probability measures need not be regular, from causal inference to the analysis of object data. An application to estimating causal effects yields a generalization of the synthetic controls method for systems with general heterogeneity described via multivariate probability measures.
Learnability of Linear Port-Hamiltonian Systems
http://jmlr.org/papers/v25/23-0450.html
http://jmlr.org/papers/volume25/23-0450/23-0450.pdf
2024Juan-Pablo Ortega, Daiying Yin
A complete structure-preserving learning scheme for single-input/single-output (SISO) linear port-Hamiltonian systems is proposed. The construction is based on the solution, when possible, of the unique identification problem for these systems, in ways that reveal fundamental relationships between classical notions in control theory and crucial properties in the machine learning context, like structure-preservation and expressive power. In the canonical case, it is shown that, {up to initializations,} the set of uniquely identified systems can be explicitly characterized as a smooth manifold endowed with global Euclidean coordinates, which allows concluding that the parameter complexity necessary for the replication of the dynamics is only $\mathcal{O}(n)$ and not $\mathcal{O}(n^2)$, as suggested by the standard parametrization of these systems. Furthermore, it is shown that linear port-Hamiltonian systems can be learned while remaining agnostic about the dimension of the underlying data-generating system. Numerical experiments show that this methodology can be used to efficiently estimate linear port-Hamiltonian systems out of input-output realizations, making the contributions in this paper the first example of a structure-preserving machine learning paradigm for linear port-Hamiltonian systems based on explicit representations of this model category.
Off-Policy Action Anticipation in Multi-Agent Reinforcement Learning
http://jmlr.org/papers/v25/23-0413.html
http://jmlr.org/papers/volume25/23-0413/23-0413.pdf
2024Ariyan Bighashdel, Daan de Geus, Pavol Jancura, Gijs Dubbelman
Learning anticipation in Multi-Agent Reinforcement Learning (MARL) is a reasoning paradigm where agents anticipate the learning steps of other agents to improve cooperation among themselves. As MARL uses gradient-based optimization, learning anticipation requires using Higher-Order Gradients (HOG), with so-called HOG methods. Existing HOG methods are based on policy parameter anticipation, i.e., agents anticipate the changes in policy parameters of other agents. Currently, however, these existing HOG methods have only been developed for differentiable games or games with small state spaces. In this work, we demonstrate that in the case of non-differentiable games with large state spaces, existing HOG methods do not perform well and are inefficient due to their inherent limitations related to policy parameter anticipation and multiple sampling stages. To overcome these problems, we propose Off-Policy Action Anticipation (OffPA2), a novel framework that approaches learning anticipation through action anticipation, i.e., agents anticipate the changes in actions of other agents, via off-policy sampling. We theoretically analyze our proposed OffPA2 and employ it to develop multiple HOG methods that are applicable to non-differentiable games with large state spaces. We conduct a large set of experiments and illustrate that our proposed HOG methods outperform the existing ones regarding efficiency and performance.
On Unbiased Estimation for Partially Observed Diffusions
http://jmlr.org/papers/v25/23-0347.html
http://jmlr.org/papers/volume25/23-0347/23-0347.pdf
2024Jeremy Heng, Jeremie Houssineau, Ajay Jasra
We consider a class of diffusion processes with finite-dimensional parameters and partially observed at discrete time instances. We propose a methodology to unbiasedly estimate the expectation of a given functional of the diffusion process conditional on parameters and data. When these unbiased estimators with appropriately chosen functionals are employed within an expectation-maximization algorithm or a stochastic gradient method, this enables statistical inference using the maximum likelihood or Bayesian framework. Compared to existing approaches, the use of our unbiased estimators allows one to remove any time-discretization bias and Markov chain Monte Carlo burn-in bias. Central to our methodology is a novel and natural combination of multilevel randomization schemes and unbiased Markov chain Monte Carlo methods, and the development of new couplings of multiple conditional particle filters. We establish under assumptions that our estimators are unbiased and have finite variance. We illustrate various aspects of our method on an Ornstein--Uhlenbeck model, a logistic diffusion model for population dynamics, and a neural network model for grid cells.
Improving Lipschitz-Constrained Neural Networks by Learning Activation Functions
http://jmlr.org/papers/v25/22-1347.html
http://jmlr.org/papers/volume25/22-1347/22-1347.pdf
2024Stanislas Ducotterd, Alexis Goujon, Pakshal Bohra, Dimitris Perdios, Sebastian Neumayer, Michael Unser
Lipschitz-constrained neural networks have several advantages over unconstrained ones and can be applied to a variety of problems, making them a topic of attention in the deep learning community. Unfortunately, it has been shown both theoretically and empirically that they perform poorly when equipped with ReLU activation functions. By contrast, neural networks with learnable 1-Lipschitz linear splines are known to be more expressive. In this paper, we show that such networks correspond to global optima of a constrained functional optimization problem that consists of the training of a neural network composed of 1-Lipschitz linear layers and 1-Lipschitz freeform activation functions with second-order total-variation regularization. Further, we propose an efficient method to train these neural networks. Our numerical experiments show that our trained networks compare favorably with existing 1-Lipschitz neural architectures.
Mathematical Framework for Online Social Media Auditing
http://jmlr.org/papers/v25/22-1112.html
http://jmlr.org/papers/volume25/22-1112/22-1112.pdf
2024Wasim Huleihel, Yehonathan Refael
Social media platforms (SMPs) leverage algorithmic filtering (AF) as a means of selecting the content that constitutes a user's feed with the aim of maximizing their rewards. Selectively choosing the contents to be shown on the user's feed may yield a certain extent of influence, either minor or major, on the user's decision-making, compared to what it would have been under a natural/fair content selection. As we have witnessed over the past decade, algorithmic filtering can cause detrimental side effects, ranging from biasing individual decisions to shaping those of society as a whole, for example, diverting users' attention from whether to get the COVID-19 vaccine or inducing the public to choose a presidential candidate. The government's constant attempts to regulate the adverse effects of AF are often complicated, due to bureaucracy, legal affairs, and financial considerations. On the other hand SMPs seek to monitor their own algorithmic activities to avoid being fined for exceeding the allowable threshold. In this paper, we mathematically formalize this framework and utilize it to construct a data-driven statistical auditing procedure to regulate AF from deflecting users' beliefs over time, along with sample complexity guarantees. This state-of-the-art algorithm can be used either by authorities acting as external regulators or by SMPs for self-auditing.
An Embedding Framework for the Design and Analysis of Consistent Polyhedral Surrogates
http://jmlr.org/papers/v25/22-0743.html
http://jmlr.org/papers/volume25/22-0743/22-0743.pdf
2024Jessie Finocchiaro, Rafael M. Frongillo, Bo Waggoner
We formalize and study the natural approach of designing convex surrogate loss functions via embeddings, for discrete problems such as classification, ranking, or structured prediction. In this approach, one embeds each of the finitely many predictions (e.g. rankings) as a point in $\mathbb{R}^d$, assigns the original loss values to these points, and “convexifies” the loss in some way to obtain a surrogate. We establish a strong connection between this approach and polyhedral (piecewise-linear convex) surrogate losses: every discrete loss is embedded by some polyhedral loss, and every polyhedral loss embeds some discrete loss. Moreover, an embedding gives rise to a consistent link function as well as linear surrogate regret bounds. Our results are constructive, as we illustrate with several examples. In particular, our framework gives succinct proofs of consistency or inconsistency for existing polyhedral surrogates, and for inconsistent surrogates, it further reveals the discrete losses for which these surrogates are consistent. We go on to show additional structure of embeddings, such as the equivalence of embedding and matching Bayes risks, and the equivalence of various notions of non-redudancy. Using these results, we establish that indirect elicitation, a necessary condition for consistency, is also sufficient when working with polyhedral surrogates.
Low-rank Variational Bayes correction to the Laplace method
http://jmlr.org/papers/v25/21-1405.html
http://jmlr.org/papers/volume25/21-1405/21-1405.pdf
2024Janet van Niekerk, Haavard Rue
Approximate inference methods like the Laplace method, Laplace approximations and variational methods, amongst others, are popular methods when exact inference is not feasible due to the complexity of the model or the abundance of data. In this paper we propose a hybrid approximate method called Low-Rank Variational Bayes correction (VBC), that uses the Laplace method and subsequently a Variational Bayes correction in a lower dimension, to the joint posterior mean. The cost is essentially that of the Laplace method which ensures scalability of the method, in both model complexity and data size. Models with fixed and unknown hyperparameters are considered, for simulated and real examples, for small and large data sets.
Scaling the Convex Barrier with Sparse Dual Algorithms
http://jmlr.org/papers/v25/21-0076.html
http://jmlr.org/papers/volume25/21-0076/21-0076.pdf
2024Alessandro De Palma, Harkirat Singh Behl, Rudy Bunel, Philip H.S. Torr, M. Pawan Kumar
Tight and efficient neural network bounding is crucial to the scaling of neural network verification systems. Many efficient bounding algorithms have been presented recently, but they are often too loose to verify more challenging properties. This is due to the weakness of the employed relaxation, which is usually a linear program of size linear in the number of neurons. While a tighter linear relaxation for piecewise-linear activations exists, it comes at the cost of exponentially many constraints and currently lacks an efficient customized solver. We alleviate this deficiency by presenting two novel dual algorithms: one operates a subgradient method on a small active set of dual variables, the other exploits the sparsity of Frank-Wolfe type optimizers to incur only a linear memory cost. Both methods recover the strengths of the new relaxation: tightness and a linear separation oracle. At the same time, they share the benefits of previous dual approaches for weaker relaxations: massive parallelism, GPU implementation, low cost per iteration and valid bounds at any time. As a consequence, we can obtain better bounds than off-the-shelf solvers in only a fraction of their running time, attaining significant formal verification speed-ups.
Causal-learn: Causal Discovery in Python
http://jmlr.org/papers/v25/23-0970.html
http://jmlr.org/papers/volume25/23-0970/23-0970.pdf
2024Yujia Zheng, Biwei Huang, Wei Chen, Joseph Ramsey, Mingming Gong, Ruichu Cai, Shohei Shimizu, Peter Spirtes, Kun Zhang
Causal discovery aims at revealing causal relations from observational data, which is a fundamental task in science and engineering. We describe causal-learn, an open-source Python library for causal discovery. This library focuses on bringing a comprehensive collection of causal discovery methods to both practitioners and researchers. It provides easy-to-use APIs for non-specialists, modular building blocks for developers, detailed documentation for learners, and comprehensive methods for all. Different from previous packages in R or Java, causal-learn is fully developed in Python, which could be more in tune with the recent preference shift in programming languages within related communities. The library is available at https://github.com/py-why/causal-learn.
Decomposed Linear Dynamical Systems (dLDS) for learning the latent components of neural dynamics
http://jmlr.org/papers/v25/23-0777.html
http://jmlr.org/papers/volume25/23-0777/23-0777.pdf
2024Noga Mudrik, Yenho Chen, Eva Yezerets, Christopher J. Rozell, Adam S. Charles
Learning interpretable representations of neural dynamics at a population level is a crucial first step to understanding how observed neural activity relates to perception and behavior. Models of neural dynamics often focus on either low-dimensional projections of neural activity or on learning dynamical systems that explicitly relate to the neural state over time. We discuss how these two approaches are interrelated by considering dynamical systems as representative of flows on a low-dimensional manifold. Building on this concept, we propose a new decomposed dynamical system model that represents complex non-stationary and nonlinear dynamics of time series data as a sparse combination of simpler, more interpretable components. Our model is trained through a dictionary learning procedure, where we leverage recent results in tracking sparse vectors over time. The decomposed nature of the dynamics is more expressive than previous switched approaches for a given number of parameters and enables modeling of overlapping and non-stationary dynamics. In both continuous-time and discrete-time instructional examples, we demonstrate that our model effectively approximates the original system, learns efficient representations, and captures smooth transitions between dynamical modes. Furthermore, we highlight our model’s ability to efficiently capture and demix population dynamics generated from multiple independent subnetworks, a task that is computationally impractical for switched models. Finally, we apply our model to neural “full brain” recordings of C. elegans data, illustrating a diversity of dynamics that is obscured when classified into discrete states.
Existence and Minimax Theorems for Adversarial Surrogate Risks in Binary Classification
http://jmlr.org/papers/v25/23-0456.html
http://jmlr.org/papers/volume25/23-0456/23-0456.pdf
2024Natalie S. Frank, Jonathan Niles-Weed
We prove existence, minimax, and complementary slackness theorems for adversarial surrogate risks in binary classification. These results extend recent work that established analogous minimax and existence theorems for the adversarial classification risk. We show that such statements continue to hold for a very general class of surrogate losses; moreover, we remove some of the technical restrictions present in prior work. Our results provide an explanation for the phenomenon of transfer attacks and inform new directions in algorithm development.
Data Thinning for Convolution-Closed Distributions
http://jmlr.org/papers/v25/23-0446.html
http://jmlr.org/papers/volume25/23-0446/23-0446.pdf
2024Anna Neufeld, Ameer Dharamshi, Lucy L. Gao, Daniela Witten
We propose data thinning, an approach for splitting an observation into two or more independent parts that sum to the original observation, and that follow the same distribution as the original observation, up to a (known) scaling of a parameter. This very general proposal is applicable to any convolution-closed distribution, a class that includes the Gaussian, Poisson, negative binomial, gamma, and binomial distributions, among others. Data thinning has a number of applications to model selection, evaluation, and inference. For instance, cross-validation via data thinning provides an attractive alternative to the usual approach of cross-validation via sample splitting, especially in settings in which the latter is not applicable. In simulations and in an application to single-cell RNA-sequencing data, we show that data thinning can be used to validate the results of unsupervised learning approaches, such as k-means clustering and principal components analysis, for which traditional sample splitting is unattractive or unavailable.
A projected semismooth Newton method for a class of nonconvex composite programs with strong prox-regularity
http://jmlr.org/papers/v25/23-0371.html
http://jmlr.org/papers/volume25/23-0371/23-0371.pdf
2024Jiang Hu, Kangkang Deng, Jiayuan Wu, Quanzheng Li
This paper aims to develop a Newton-type method to solve a class of nonconvex composite programs. In particular, the nonsmooth part is possibly nonconvex. To tackle the nonconvexity, we develop a notion of strong prox-regularity which is related to the singleton property and Lipschitz continuity of the associated proximal operator, and we verify it in various classes of functions, including weakly convex functions, indicator functions of proximally smooth sets, and two specific sphere-related nonconvex nonsmooth functions. In this case, the problem class we are concerned with covers smooth optimization problems on manifold and certain composite optimization problems on manifold. For the latter, the proposed algorithm is the first second-order type method. Combining with the semismoothness of the proximal operator, we design a projected semismooth Newton method to find a root of the natural residual induced by the proximal gradient method. Due to the possible nonconvexity of the feasible domain, an extra projection is added to the usual semismooth Newton step and new criteria are proposed for the switching between the projected semismooth Newton step and the proximal step. The global convergence is then established under the strong prox-regularity. Based on the BD regularity condition, we establish local superlinear convergence. Numerical experiments demonstrate the effectiveness of our proposed method compared with state-of-the-art ones.
Revisiting RIP Guarantees for Sketching Operators on Mixture Models
http://jmlr.org/papers/v25/23-0044.html
http://jmlr.org/papers/volume25/23-0044/23-0044.pdf
2024Ayoub Belhadji, Rémi Gribonval
In the context of sketching for compressive mixture modeling, we revisit existing proofs of the Restricted Isometry Property of sketching operators with respect to certain mixtures models. After examining the shortcomings of existing guarantees, we propose an alternative analysis that circumvents the need to assume importance sampling when drawing random Fourier features to build random sketching operators. Our analysis is based on new deterministic bounds on the restricted isometry constant that depend solely on the set of frequencies used to define the sketching operator; then we leverage these bounds to establish concentration inequalities for random sketching operators that lead to the desired RIP guarantees. Our analysis also opens the door to theoretical guarantees for structured sketching with frequencies associated to fast random linear operators.
Monotonic Risk Relationships under Distribution Shifts for Regularized Risk Minimization
http://jmlr.org/papers/v25/22-1197.html
http://jmlr.org/papers/volume25/22-1197/22-1197.pdf
2024Daniel LeJeune, Jiayu Liu, Reinhard Heckel
Machine learning systems are often applied to data that is drawn from a different distribution than the training distribution. Recent work has shown that for a variety of classification and signal reconstruction problems, the out-of-distribution performance is strongly linearly correlated with the in-distribution performance. If this relationship or more generally a monotonic one holds, it has important consequences. For example, it allows to optimize performance on one distribution as a proxy for performance on the other. In this paper, we study conditions under which a monotonic relationship between the performances of a model on two distributions is expected. We prove an exact asymptotic linear relation for squared error and a monotonic relation for misclassification error for ridge-regularized general linear models under covariate shift, as well as an approximate linear relation for linear inverse problems.
Polygonal Unadjusted Langevin Algorithms: Creating stable and efficient adaptive algorithms for neural networks
http://jmlr.org/papers/v25/22-0796.html
http://jmlr.org/papers/volume25/22-0796/22-0796.pdf
2024Dong-Young Lim, Sotirios Sabanis
We present a new class of Langevin-based algorithms, which overcomes many of the known shortcomings of popular adaptive optimizers that are currently used for the fine tuning of deep learning models. Its underpinning theory relies on recent advances of Euler-Krylov polygonal approximations for stochastic differential equations (SDEs) with monotone coefficients. As a result, it inherits the stability properties of tamed algorithms, while it addresses other known issues, e.g. vanishing gradients in deep learning. In particular, we provide a nonasymptotic analysis and full theoretical guarantees for the convergence properties of an algorithm of this novel class, which we named TH$\varepsilon$O POULA (or, simply, TheoPouLa). Finally, several experiments are presented with different types of deep learning models, which show the superior performance of TheoPouLa over many popular adaptive optimization algorithms.
Axiomatic effect propagation in structural causal models
http://jmlr.org/papers/v25/22-0285.html
http://jmlr.org/papers/volume25/22-0285/22-0285.pdf
2024Raghav Singal, George Michailidis
We study effect propagation in a causal directed acyclic graph (DAG), with the goal of providing a flow-based decomposition of the effect (i.e., change in the outcome variable) as a result of changes in the source variables. We first compare various ideas on causality to quantify effect propagation, such as direct and indirect effects, path-specific effects, and degree of responsibility. We discuss the shortcomings of such approaches and propose a flow-based methodology, which we call recursive Shapley value (RSV). By considering a broader set of counterfactuals than existing methods, RSV obeys a unique adherence to four desirable flow-based axioms. Further, we provide a general path-based characterization of RSV for an arbitrary non-parametric structural equations model (SEM) defined on the underlying DAG. Interestingly, for the special class of linear SEMs, RSV exhibits a simple and tractable characterization (and hence, computation), which recovers the classical method of path coefficients and is equivalent to path-specific effects. For non-parametric SEMs, we use our general characterization to develop an unbiased Monte-Carlo estimation procedure with an exponentially decaying sample complexity. We showcase the application of RSV on two challenging problems on causality (causal overdetermination and causal unfairness).
Optimal First-Order Algorithms as a Function of Inequalities
http://jmlr.org/papers/v25/21-1256.html
http://jmlr.org/papers/volume25/21-1256/21-1256.pdf
2024Chanwoo Park, Ernest K. Ryu
In this work, we present a novel algorithm design methodology that finds the optimal algorithm as a function of inequalities. Specifically, we restrict convergence analyses of algorithms to use a prespecified subset of inequalities, rather than utilizing all true inequalities, and find the optimal algorithm subject to this restriction. This methodology allows us to design algorithms with certain desired characteristics. As concrete demonstrations of this methodology, we find new state-of-the-art accelerated first-order gradient methods using randomized coordinate updates and backtracking line searches.
Resource-Efficient Neural Networks for Embedded Systems
http://jmlr.org/papers/v25/18-566.html
http://jmlr.org/papers/volume25/18-566/18-566.pdf
2024Wolfgang Roth, Günther Schindler, Bernhard Klein, Robert Peharz, Sebastian Tschiatschek, Holger Fröning, Franz Pernkopf, Zoubin Ghahramani
While machine learning is traditionally a resource intensive task, embedded systems, autonomous navigation, and the vision of the Internet of Things fuel the interest in resource-efficient approaches. These approaches aim for a carefully chosen trade-off between performance and resource consumption in terms of computation and energy. The development of such approaches is among the major challenges in current machine learning research and key to ensure a smooth transition of machine learning technology from a scientific environment with virtually unlimited computing resources into everyday's applications. In this article, we provide an overview of the current state of the art of machine learning techniques facilitating these real-world requirements. In particular, we focus on resource-efficient inference based on deep neural networks (DNNs), the predominant machine learning models of the past decade. We give a comprehensive overview of the vast literature that can be mainly split into three non-mutually exclusive categories: (i) quantized neural networks, (ii) network pruning, and (iii) structural efficiency. These techniques can be applied during training or as post-processing, and they are widely used to reduce the computational demands in terms of memory footprint, inference speed, and energy efficiency. We also briefly discuss different concepts of embedded hardware for DNNs and their compatibility with machine learning techniques as well as potential for energy and latency reduction. We substantiate our discussion with experiments on well-known benchmark data sets using compression techniques (quantization, pruning) for a set of resource-constrained embedded systems, such as CPUs, GPUs and FPGAs. The obtained results highlight the difficulty of finding good trade-offs between resource efficiency and prediction quality.
Trained Transformers Learn Linear Models In-Context
http://jmlr.org/papers/v25/23-1042.html
http://jmlr.org/papers/volume25/23-1042/23-1042.pdf
2024Ruiqi Zhang, Spencer Frei, Peter L. Bartlett
Attention-based neural networks such as transformers have demonstrated a remarkable ability to exhibit in-context learning (ICL): Given a short prompt sequence of tokens from an unseen task, they can formulate relevant per-token and next-token predictions without any parameter updates. By embedding a sequence of labeled training data and unlabeled test data as a prompt, this allows for transformers to behave like supervised learning algorithms. Indeed, recent work has shown that when training transformer architectures over random instances of linear regression problems, these models' predictions mimic those of ordinary least squares. Towards understanding the mechanisms underlying this phenomenon, we investigate the dynamics of ICL in transformers with a single linear self-attention layer trained by gradient flow on linear regression tasks. We show that despite non-convexity, gradient flow with a suitable random initialization finds a global minimum of the objective function. At this global minimum, when given a test prompt of labeled examples from a new prediction task, the transformer achieves prediction error competitive with the best linear predictor over the test prompt distribution. We additionally characterize the robustness of the trained transformer to a variety of distribution shifts and show that although a number of shifts are tolerated, shifts in the covariate distribution of the prompts are not. Motivated by this, we consider a generalized ICL setting where the covariate distributions can vary across prompts. We show that although gradient flow succeeds at finding a global minimum in this setting, the trained transformer is still brittle under mild covariate shifts. We complement this finding with experiments on large, nonlinear transformer architectures which we show are more robust under covariate shifts.
Adam-family Methods for Nonsmooth Optimization with Convergence Guarantees
http://jmlr.org/papers/v25/23-0576.html
http://jmlr.org/papers/volume25/23-0576/23-0576.pdf
2024Nachuan Xiao, Xiaoyin Hu, Xin Liu, Kim-Chuan Toh
In this paper, we present a comprehensive study on the convergence properties of Adam-family methods for nonsmooth optimization, especially in the training of nonsmooth neural networks. We introduce a novel two-timescale framework that adopts a two-timescale updating scheme, and prove its convergence properties under mild assumptions. Our proposed framework encompasses various popular Adam-family methods, providing convergence guarantees for these methods in training nonsmooth neural networks. Furthermore, we develop stochastic subgradient methods that incorporate gradient clipping techniques for training nonsmooth neural networks with heavy-tailed noise. Through our framework, we show that our proposed methods converge even when the evaluation noises are only assumed to be integrable. Extensive numerical experiments demonstrate the high efficiency and robustness of our proposed methods.
Efficient Modality Selection in Multimodal Learning
http://jmlr.org/papers/v25/23-0439.html
http://jmlr.org/papers/volume25/23-0439/23-0439.pdf
2024Yifei He, Runxiang Cheng, Gargi Balasubramaniam, Yao-Hung Hubert Tsai, Han Zhao
Multimodal learning aims to learn from data of different modalities by fusing information from heterogeneous sources. Although it is beneficial to learn from more modalities, it is often infeasible to use all available modalities under limited computational resources. Modeling with all available modalities can also be inefficient and unnecessary when information across input modalities overlaps. In this paper, we study the modality selection problem, which aims to select the most useful subset of modalities for learning under a cardinality constraint. To that end, we propose a unified theoretical framework to quantify the learning utility of modalities, and we identify dependence assumptions to flexibly model the heterogeneous nature of multimodal data, which also allows efficient algorithm design. Accordingly, we derive a greedy modality selection algorithm via submodular maximization, which selects the most useful modalities with an optimality guarantee on learning performance. We also connect marginal-contribution-based feature importance scores, such as Shapley value, from the feature selection domain to the context of modality selection, to efficiently compute the importance of individual modality. We demonstrate the efficacy of our theoretical results and modality selection algorithms on 2 synthetic and 4 real-world data sets on a diverse range of multimodal data.
A Multilabel Classification Framework for Approximate Nearest Neighbor Search
http://jmlr.org/papers/v25/23-0286.html
http://jmlr.org/papers/volume25/23-0286/23-0286.pdf
2024Ville Hyvönen, Elias Jääsaari, Teemu Roos
To learn partition-based index structures for approximate nearest neighbor (ANN) search, both supervised and unsupervised machine learning algorithms have been used. Existing supervised algorithms select all the points that belong to the same partition element as the query point as nearest neighbor candidates. Consequently, they formulate the learning task as finding a partition in which the nearest neighbors of a query point belong to the same partition element with it as often as possible. In contrast, we formulate the candidate set selection in ANN search directly as a multilabel classification problem where the labels correspond to the nearest neighbors of the query point. In the proposed framework, partition-based index structures are interpreted as partitioning classifiers for solving this classification problem. Empirical results suggest that, when combined with any partitioning strategy, the natural classifier based on the proposed framework leads to a strictly improved performance compared to the earlier candidate set selection methods. We also prove a sufficient condition for the consistency of a partitioning classifier for ANN search, and illustrate the result by verifying this condition for chronological $k$-d trees and (both dense and sparse) random projection trees.
Probabilistic Forecasting with Generative Networks via Scoring Rule Minimization
http://jmlr.org/papers/v25/23-0038.html
http://jmlr.org/papers/volume25/23-0038/23-0038.pdf
2024Lorenzo Pacchiardi, Rilwan A. Adewoyin, Peter Dueben, Ritabrata Dutta
Probabilistic forecasting relies on past observations to provide a probability distribution for a future outcome, which is often evaluated against the realization using a scoring rule. Here, we perform probabilistic forecasting with generative neural networks, which parametrize distributions on high-dimensional spaces by transforming draws from a latent variable. Generative networks are typically trained in an adversarial framework. In contrast, we propose to train generative networks to minimize a predictive-sequential (or prequential) scoring rule on a recorded temporal sequence of the phenomenon of interest, which is appealing as it corresponds to the way forecasting systems are routinely evaluated. Adversarial-free minimization is possible for some scoring rules; hence, our framework avoids the cumbersome hyperparameter tuning and uncertainty underestimation due to unstable adversarial training, thus unlocking reliable use of generative networks in probabilistic forecasting. Further, we prove consistency of the minimizer of our objective with dependent data, while adversarial training assumes independence. We perform simulation studies on two chaotic dynamical models and a benchmark data set of global weather observations; for this last example, we define scoring rules for spatial data by drawing from the relevant literature. Our method outperforms state-of-the-art adversarial approaches, especially in probabilistic calibration, while requiring less hyperparameter tuning.
Multiple Descent in the Multiple Random Feature Model
http://jmlr.org/papers/v25/22-1389.html
http://jmlr.org/papers/volume25/22-1389/22-1389.pdf
2024Xuran Meng, Jianfeng Yao, Yuan Cao
Recent works have demonstrated a double descent phenomenon in over-parameterized learning. Although this phenomenon has been investigated by recent works, it has not been fully understood in theory. In this paper, we investigate the multiple descent phenomenon in a class of multi-component prediction models. We first consider a "double random feature model" (DRFM) concatenating two types of random features, and study the excess risk achieved by the DRFM in ridge regression. We calculate the precise limit of the excess risk under the high dimensional framework where the training sample size, the dimension of data, and the dimension of random features tend to infinity proportionally. Based on the calculation, we further theoretically demonstrate that the risk curves of DRFMs can exhibit triple descent. We then provide a thorough experimental study to verify our theory. At last, we extend our study to the "multiple random feature model" (MRFM), and show that MRFMs ensembling $K$ types of random features may exhibit $(K+1)$-fold descent. Our analysis points out that risk curves with a specific number of descent generally exist in learning multi-component prediction models.
Mean-Square Analysis of Discretized Itô Diffusions for Heavy-tailed Sampling
http://jmlr.org/papers/v25/22-1198.html
http://jmlr.org/papers/volume25/22-1198/22-1198.pdf
2024Ye He, Tyler Farghly, Krishnakumar Balasubramanian, Murat A. Erdogdu
We analyze the complexity of sampling from a class of heavy-tailed distributions by discretizing a natural class of Itô diffusions associated with weighted Poincaré inequalities. Based on a mean-square analysis, we establish the iteration complexity for obtaining a sample whose distribution is $\epsilon$ close to the target distribution in the Wasserstein-2 metric. In this paper, our results take the mean-square analysis to its limits, i.e., we invariably only require that the target density has finite variance, the minimal requirement for a mean-square analysis. To obtain explicit estimates, we compute upper bounds on certain moments associated with heavy-tailed targets under various assumptions. We also provide similar iteration complexity results for the case where only function evaluations of the unnormalized target density are available by estimating the gradients using a Gaussian smoothing technique. We provide illustrative examples based on the multivariate $t$-distribution.
Invariant and Equivariant Reynolds Networks
http://jmlr.org/papers/v25/22-0891.html
http://jmlr.org/papers/volume25/22-0891/22-0891.pdf
2024Akiyoshi Sannai, Makoto Kawano, Wataru Kumagai
Various data exhibit symmetry, including permutations in graphs and point clouds. Machine learning methods that utilize this symmetry have achieved considerable success. In this study, we explore learning models for data exhibiting group symmetry. Our focus is on transforming deep neural networks using Reynolds operators, which average over the group to convert a function into an invariant or equivariant form. While learning methods based on Reynolds operators are well-established, they often face computational complexity challenges. To address this, we introduce two new methods that reduce the computational burden associated with the Reynolds operator: (i) Although the Reynolds operator traditionally averages over the entire group, we demonstrate that it can be effectively approximated by averaging over specific subsets of the group, termed the Reynolds design. (ii) We reveal that the pre-model does not require all input variables. Instead, using a select number of partial inputs (Reynolds dimension) is sufficient to achieve a universally applicable model. Employing these methods, which hinge on the Reynolds design and Reynolds dimension concepts, allows us to construct universally applicable models with manageable computational complexity. Our experiments on benchmark data indicate that our approach is more efficient than existing methods.
Personalized PCA: Decoupling Shared and Unique Features
http://jmlr.org/papers/v25/22-0810.html
http://jmlr.org/papers/volume25/22-0810/22-0810.pdf
2024Naichen Shi, Raed Al Kontar
In this paper, we tackle a significant challenge in PCA: heterogeneity. When data are collected from different sources with heterogeneous trends while still sharing some congruency, it is critical to extract shared knowledge while retaining the unique features of each source. To this end, we propose personalized PCA (PerPCA), which uses mutually orthogonal global and local principal components to encode both unique and shared features. We show that, under mild conditions, both unique and shared features can be identified and recovered by a constrained optimization problem, even if the covariance matrices are immensely different. Also, we design a fully federated algorithm inspired by distributed Stiefel gradient descent to solve the problem. The algorithm introduces a new group of operations called generalized retractions to handle orthogonality constraints, and only requires global PCs to be shared across sources. We prove the linear convergence of the algorithm under suitable assumptions. Comprehensive numerical experiments highlight PerPCA's superior performance in feature extraction and prediction from heterogeneous datasets. As a systematic approach to decouple shared and unique features from heterogeneous datasets, PerPCA finds applications in several tasks, including video segmentation, topic extraction, and feature clustering.
Survival Kernets: Scalable and Interpretable Deep Kernel Survival Analysis with an Accuracy Guarantee
http://jmlr.org/papers/v25/22-0667.html
http://jmlr.org/papers/volume25/22-0667/22-0667.pdf
2024George H. Chen
Kernel survival analysis models estimate individual survival distributions with the help of a kernel function, which measures the similarity between any two data points. Such a kernel function can be learned using deep kernel survival models. In this paper, we present a new deep kernel survival model called a survival kernet, which scales to large datasets in a manner that is amenable to model interpretation and also theoretical analysis. Specifically, the training data are partitioned into clusters based on a recently developed training set compression scheme for classification and regression called kernel netting that we extend to the survival analysis setting. At test time, each data point is represented as a weighted combination of these clusters, and each such cluster can be visualized. For a special case of survival kernets, we establish a finite-sample error bound on predicted survival distributions that is, up to a log factor, optimal. Whereas scalability at test time is achieved using the aforementioned kernel netting compression strategy, scalability during training is achieved by a warm-start procedure based on tree ensembles such as XGBoost and a heuristic approach to accelerating neural architecture search. On four standard survival analysis datasets of varying sizes (up to roughly 3 million data points), we show that survival kernets are highly competitive compared to various baselines tested in terms of time-dependent concordance index. Our code is available at: https://github.com/georgehc/survival-kernets
On the Sample Complexity and Metastability of Heavy-tailed Policy Search in Continuous Control
http://jmlr.org/papers/v25/21-1343.html
http://jmlr.org/papers/volume25/21-1343/21-1343.pdf
2024Amrit Singh Bedi, Anjaly Parayil, Junyu Zhang, Mengdi Wang, Alec Koppel
Reinforcement learning is a framework for interactive decision-making with incentives sequentially revealed across time without a system dynamics model. Due to its scaling to continuous spaces, we focus on policy search where one iteratively improves a parameterized policy with stochastic policy gradient (PG) updates. In tabular Markov Decision Problems (MDPs), under persistent exploration and suitable parameterization, global optimality may be obtained. By contrast, in continuous space, the non-convexity poses a pathological challenge as evidenced by existing convergence results being mostly limited to stationarity or arbitrary local extrema. To close this gap, we step towards persistent exploration in continuous space through policy parameterizations defined by distributions of heavier tails defined by tail-index parameter $\alpha$, which increases the likelihood of jumping in state space. Doing so invalidates smoothness conditions of the score function common to PG. Thus, we establish how the convergence rate to stationarity depends on the policy's tail index $\alpha$, a Hölder continuity parameter, integrability conditions, and an exploration tolerance parameter introduced here for the first time. Further, we characterize the dependence of the set of local maxima on the tail index through an exit and transition time analysis of a suitably defined Markov chain, identifying that policies associated with Lévy Processes of a heavier tail converge to wider peaks. This phenomenon yields improved stability to perturbations in supervised learning, which we corroborate also manifests in improved performance of policy search, especially when myopic and farsighted incentives are misaligned.
Convergence for nonconvex ADMM, with applications to CT imaging
http://jmlr.org/papers/v25/21-0831.html
http://jmlr.org/papers/volume25/21-0831/21-0831.pdf
2024Rina Foygel Barber, Emil Y. Sidky
The alternating direction method of multipliers (ADMM) algorithm is a powerful and flexible tool for complex optimization problems of the form $\min\{f(x)+g(y) : Ax+By=c\}$. ADMM exhibits robust empirical performance across a range of challenging settings including nonsmoothness and nonconvexity of the objective functions $f$ and $g$, and provides a simple and natural approach to the inverse problem of image reconstruction for computed tomography (CT) imaging. From the theoretical point of view, existing results for convergence in the nonconvex setting generally assume smoothness in at least one of the component functions in the objective. In this work, our new theoretical results provide convergence guarantees under a restricted strong convexity assumption without requiring smoothness or differentiability, while still allowing differentiable terms to be treated approximately if needed. We validate these theoretical results empirically, with a simulated example where both $f$ and $g$ are nondifferentiable---and thus outside the scope of existing theory---as well as a simulated CT image reconstruction problem.
Distributed Gaussian Mean Estimation under Communication Constraints: Optimal Rates and Communication-Efficient Algorithms
http://jmlr.org/papers/v25/21-0316.html
http://jmlr.org/papers/volume25/21-0316/21-0316.pdf
2024T. Tony Cai, Hongji Wei
Distributed estimation of a Gaussian mean under communication constraints is studied in a decision theoretical framework. Minimax rates of convergence, which characterize the tradeoff between communication costs and statistical accuracy, are established under the independent protocols. Communication-efficient and statistically optimal procedures are developed. In the univariate case, the optimal rate depends only on the total communication budget, so long as each local machine has at least one bit. However, in the multivariate case, the minimax rate depends on the specific allocations of the communication budgets among the local machines. Although optimal estimation of a Gaussian mean is relatively simple in the conventional setting, it is quite involved under communication constraints, both in terms of the optimal procedure design and the lower bound argument. An essential step is the decomposition of the minimax estimation problem into two stages, localization and refinement. This critical decomposition provides a framework for both the lower bound analysis and optimal procedure design. The optimality results and techniques developed in the present paper can be useful for solving other problems such as distributed nonparametric function estimation and sparse signal recovery.
Sparse NMF with Archetypal Regularization: Computational and Robustness Properties
http://jmlr.org/papers/v25/21-0233.html
http://jmlr.org/papers/volume25/21-0233/21-0233.pdf
2024Kayhan Behdin, Rahul Mazumder
We consider the problem of sparse nonnegative matrix factorization (NMF) using archetypal regularization. The goal is to represent a collection of data points as nonnegative linear combinations of a few nonnegative sparse factors with appealing geometric properties, arising from the use of archetypal regularization. We generalize the notion of robustness studied in Javadi and Montanari (2019) (without sparsity) to the notions of (a) strong robustness that implies each estimated archetype is close to the underlying archetypes and (b) weak robustness that implies there exists at least one recovered archetype that is close to the underlying archetypes. Our theoretical results on robustness guarantees hold under minimal assumptions on the underlying data, and applies to settings where the underlying archetypes need not be sparse. We present theoretical results and illustrative examples to strengthen the insights underlying the notions of robustness. We propose new algorithms for our optimization problem; and present numerical experiments on synthetic and real data sets that shed further insights into our proposed framework and theoretical developments.
Deep Network Approximation: Beyond ReLU to Diverse Activation Functions
http://jmlr.org/papers/v25/23-0912.html
http://jmlr.org/papers/volume25/23-0912/23-0912.pdf
2024Shijun Zhang, Jianfeng Lu, Hongkai Zhao
This paper explores the expressive power of deep neural networks for a diverse range of activation functions. An activation function set $\mathscr{A}$ is defined to encompass the majority of commonly used activation functions, such as $\mathtt{ReLU}$, $\mathtt{LeakyReLU}$, $\mathtt{ReLU}^2$, $\mathtt{ELU}$, $\mathtt{CELU}$, $\mathtt{SELU}$, $\mathtt{Softplus}$, $\mathtt{GELU}$, $\mathtt{SiLU}$, $\mathtt{Swish}$, $\mathtt{Mish}$, $\mathtt{Sigmoid}$, $\mathtt{Tanh}$, $\mathtt{Arctan}$, $\mathtt{Softsign}$, $\mathtt{dSiLU}$, and $\mathtt{SRS}$. We demonstrate that for any activation function $\varrho\in \mathscr{A}$, a $\mathtt{ReLU}$ network of width $N$ and depth $L$ can be approximated to arbitrary precision by a $\varrho$-activated network of width $3N$ and depth $2L$ on any bounded set. This finding enables the extension of most approximation results achieved with $\mathtt{ReLU}$ networks to a wide variety of other activation functions, albeit with slightly increased constants. Significantly, we establish that the (width,$\,$depth) scaling factors can be further reduced from $(3,2)$ to $(1,1)$ if $\varrho$ falls within a specific subset of $\mathscr{A}$. This subset includes activation functions such as $\mathtt{ELU}$, $\mathtt{CELU}$, $\mathtt{SELU}$, $\mathtt{Softplus}$, $\mathtt{GELU}$, $\mathtt{SiLU}$, $\mathtt{Swish}$, and $\mathtt{Mish}$.
Effect-Invariant Mechanisms for Policy Generalization
http://jmlr.org/papers/v25/23-0802.html
http://jmlr.org/papers/volume25/23-0802/23-0802.pdf
2024Sorawit Saengkyongam, Niklas Pfister, Predrag Klasnja, Susan Murphy, Jonas Peters
Policy learning is an important component of many real-world learning systems. A major challenge in policy learning is how to adapt efficiently to unseen environments or tasks. Recently, it has been suggested to exploit invariant conditional distributions to learn models that generalize better to unseen environments. However, assuming invariance of entire conditional distributions (which we call full invariance) may be too strong of an assumption in practice. In this paper, we introduce a relaxation of full invariance called effect-invariance (e-invariance for short) and prove that it is sufficient, under suitable assumptions, for zero-shot policy generalization. We also discuss an extension that exploits e-invariance when we have a small sample from the test environment, enabling few-shot policy generalization. Our work does not assume an underlying causal graph or that the data are generated by a structural causal model; instead, we develop testing procedures to test e-invariance directly from data. We present empirical results using simulated data and a mobile health intervention dataset to demonstrate the effectiveness of our approach.
Pygmtools: A Python Graph Matching Toolkit
http://jmlr.org/papers/v25/23-0572.html
http://jmlr.org/papers/volume25/23-0572/23-0572.pdf
2024Runzhong Wang, Ziao Guo, Wenzheng Pan, Jiale Ma, Yikai Zhang, Nan Yang, Qi Liu, Longxuan Wei, Hanxue Zhang, Chang Liu, Zetian Jiang, Xiaokang Yang, Junchi Yan
Graph matching aims to find node-to-node matching among multiple graphs, which is a fundamental yet challenging problem. To facilitate graph matching in scientific research and industrial applications, pygmtools is released, which is a Python graph matching toolkit that implements a comprehensive collection of two-graph matching and multi-graph matching solvers, covering both learning-free solvers as well as learning-based neural graph matching solvers. Our implementation supports numerical backends including Numpy, PyTorch, Jittor, Paddle, runs on Windows, MacOS and Linux, and is friendly to install and configure. Comprehensive documentations covering beginner's guide, API reference and examples are available online. pygmtools is open-sourced under Mulan PSL v2 license.
Heterogeneous-Agent Reinforcement Learning
http://jmlr.org/papers/v25/23-0488.html
http://jmlr.org/papers/volume25/23-0488/23-0488.pdf
2024Yifan Zhong, Jakub Grudzien Kuba, Xidong Feng, Siyi Hu, Jiaming Ji, Yaodong Yang
The necessity for cooperation among intelligent machines has popularised cooperative multi-agent reinforcement learning (MARL) in AI research. However, many research endeavours heavily rely on parameter sharing among agents, which confines them to only homogeneous-agent setting and leads to training instability and lack of convergence guarantees. To achieve effective cooperation in the general heterogeneous-agent setting, we propose Heterogeneous-Agent Reinforcement Learning (HARL) algorithms that resolve the aforementioned issues. Central to our findings are the multi-agent advantage decomposition lemma and the sequential update scheme. Based on these, we develop the provably correct Heterogeneous-Agent Trust Region Learning (HATRL), and derive HATRPO and HAPPO by tractable approximations. Furthermore, we discover a novel framework named Heterogeneous-Agent Mirror Learning (HAML), which strengthens theoretical guarantees for HATRPO and HAPPO and provides a general template for cooperative MARL algorithmic designs. We prove that all algorithms derived from HAML inherently enjoy monotonic improvement of joint return and convergence to Nash Equilibrium. As its natural outcome, HAML validates more novel algorithms in addition to HATRPO and HAPPO, including HAA2C, HADDPG, and HATD3, which generally outperform their existing MA-counterparts. We comprehensively test HARL algorithms on six challenging benchmarks and demonstrate their superior effectiveness and stability for coordinating heterogeneous agents compared to strong baselines such as MAPPO and QMIX.
Sample-efficient Adversarial Imitation Learning
http://jmlr.org/papers/v25/23-0314.html
http://jmlr.org/papers/volume25/23-0314/23-0314.pdf
2024Dahuin Jung, Hyungyu Lee, Sungroh Yoon
Imitation learning, in which learning is performed by demonstration, has been studied and advanced for sequential decision-making tasks in which a reward function is not predefined. However, imitation learning methods still require numerous expert demonstration samples to successfully imitate an expert's behavior. To improve sample efficiency, we utilize self-supervised representation learning, which can generate vast training signals from the given data. In this study, we propose a self-supervised representation-based adversarial imitation learning method to learn state and action representations that are robust to diverse distortions and temporally predictive, on non-image control tasks. In particular, in comparison with existing self-supervised learning methods for tabular data, we propose a different corruption method for state and action representations that is robust to diverse distortions. We theoretically and empirically observe that making an informative feature manifold with less sample complexity significantly improves the performance of imitation learning. The proposed method shows a 39% relative improvement over existing adversarial imitation learning methods on MuJoCo in a setting limited to 100 expert state-action pairs. Moreover, we conduct comprehensive ablations and additional experiments using demonstrations with varying optimality to provide insights into a range of factors.
Stochastic Modified Flows, Mean-Field Limits and Dynamics of Stochastic Gradient Descent
http://jmlr.org/papers/v25/23-0220.html
http://jmlr.org/papers/volume25/23-0220/23-0220.pdf
2024Benjamin Gess, Sebastian Kassing, Vitalii Konarovskyi
We propose new limiting dynamics for stochastic gradient descent in the small learning rate regime called stochastic modified flows. These SDEs are driven by a cylindrical Brownian motion and improve the so-called stochastic modified equations by having regular diffusion coefficients and by matching the multi-point statistics. As a second contribution, we introduce distribution dependent stochastic modified flows which we prove to describe the fluctuating limiting dynamics of stochastic gradient descent in the small learning rate - infinite width scaling regime.
Rates of convergence for density estimation with generative adversarial networks
http://jmlr.org/papers/v25/23-0062.html
http://jmlr.org/papers/volume25/23-0062/23-0062.pdf
2024Nikita Puchkin, Sergey Samsonov, Denis Belomestny, Eric Moulines, Alexey Naumov
In this work we undertake a thorough study of the non-asymptotic properties of the vanilla generative adversarial networks (GANs). We prove an oracle inequality for the Jensen-Shannon (JS) divergence between the underlying density $\mathsf{p}^*$ and the GAN estimate with a significantly better statistical error term compared to the previously known results. The advantage of our bound becomes clear in application to nonparametric density estimation. We show that the JS-divergence between the GAN estimate and $\mathsf{p}^*$ decays as fast as $(\log{n}/n)^{2\beta/(2\beta + d)}$, where $n$ is the sample size and $\beta$ determines the smoothness of $\mathsf{p}^*$. This rate of convergence coincides (up to logarithmic factors) with minimax optimal for the considered class of densities.
Additive smoothing error in backward variational inference for general state-space models
http://jmlr.org/papers/v25/22-1392.html
http://jmlr.org/papers/volume25/22-1392/22-1392.pdf
2024Mathis Chagneux, Elisabeth Gassiat, Pierre Gloaguen, Sylvain Le Corff
We consider the problem of state estimation in general state-space models using variational inference. For a generic variational family defined using the same backward decomposition as the actual joint smoothing distribution, we establish under mixing assumptions that the variational approximation of expectations of additive state functionals induces an error which grows at most linearly in the number of observations. This guarantee is consistent with the known upper bounds for the approximation of smoothing distributions using standard Monte Carlo methods. We illustrate our theoretical result with state-of-the art variational solutions based both on the backward parameterization and on alternatives using forward decompositions. This numerical study proposes guidelines for variational inference based on neural networks in state-space models.
Optimal Bump Functions for Shallow ReLU networks: Weight Decay, Depth Separation, Curse of Dimensionality
http://jmlr.org/papers/v25/22-1296.html
http://jmlr.org/papers/volume25/22-1296/22-1296.pdf
2024Stephan Wojtowytsch
In this note, we study how neural networks with a single hidden layer and ReLU activation interpolate data drawn from a radially symmetric distribution with target labels 1 at the origin and 0 outside the unit ball, if no labels are known inside the unit ball. With weight decay regularization and in the infinite neuron, infinite data limit, we prove that a unique radially symmetric minimizer exists, whose average parameters and Lipschitz constant grow as $d$ and $\sqrt{d}$ respectively. We furthermore show that the average weight variable grows exponentially in $d$ if the label $1$ is imposed on a ball of radius $\varepsilon$ rather than just at the origin. By comparison, a neural networks with two hidden layers can approximate the target function without encountering the curse of dimensionality.
Numerically Stable Sparse Gaussian Processes via Minimum Separation using Cover Trees
http://jmlr.org/papers/v25/22-1170.html
http://jmlr.org/papers/volume25/22-1170/22-1170.pdf
2024Alexander Terenin, David R. Burt, Artem Artemev, Seth Flaxman, Mark van der Wilk, Carl Edward Rasmussen, Hong Ge
Gaussian processes are frequently deployed as part of larger machine learning and decision-making systems, for instance in geospatial modeling, Bayesian optimization, or in latent Gaussian models. Within a system, the Gaussian process model needs to perform in a stable and reliable manner to ensure it interacts correctly with other parts of the system. In this work, we study the numerical stability of scalable sparse approximations based on inducing points. To do so, we first review numerical stability, and illustrate typical situations in which Gaussian process models can be unstable. Building on stability theory originally developed in the interpolation literature, we derive sufficient and in certain cases necessary conditions on the inducing points for the computations performed to be numerically stable. For low-dimensional tasks such as geospatial modeling, we propose an automated method for computing inducing points satisfying these conditions. This is done via a modification of the cover tree data structure, which is of independent interest. We additionally propose an alternative sparse approximation for regression with a Gaussian likelihood which trades off a small amount of performance to further improve stability. We provide illustrative examples showing the relationship between stability of calculations and predictive performance of inducing point methods on spatial tasks.
On Tail Decay Rate Estimation of Loss Function Distributions
http://jmlr.org/papers/v25/22-0846.html
http://jmlr.org/papers/volume25/22-0846/22-0846.pdf
2024Etrit Haxholli, Marco Lorenzi
The study of loss-function distributions is critical to characterize a model's behaviour on a given machine-learning problem. While model quality is commonly measured by the average loss assessed on a testing set, this quantity does not ascertain the existence of the mean of the loss distribution. Conversely, the existence of a distribution's statistical moments can be verified by examining the thickness of its tails. Cross-validation schemes determine a family of testing loss distributions conditioned on the training sets. By marginalizing across training sets, we can recover the overall (marginal) loss distribution, whose tail-shape we aim to estimate. Small sample-sizes diminish the reliability and efficiency of classical tail-estimation methods like Peaks-Over-Threshold, and we demonstrate that this effect is notably significant when estimating tails of marginal distributions composed of conditional distributions with substantial tail-location variability. We mitigate this problem by utilizing a result we prove: under certain conditions, the marginal-distribution's tail-shape parameter is the maximum tail-shape parameter across the conditional distributions underlying the marginal. We label the resulting approach as `cross-tail estimation (CTE)'. We test CTE in a series of experiments on simulated and real data showing the improved robustness and quality of tail estimation as compared to classical approaches.
Deep Nonparametric Estimation of Operators between Infinite Dimensional Spaces
http://jmlr.org/papers/v25/22-0719.html
http://jmlr.org/papers/volume25/22-0719/22-0719.pdf
2024Hao Liu, Haizhao Yang, Minshuo Chen, Tuo Zhao, Wenjing Liao
Learning operators between infinitely dimensional spaces is an important learning task arising in machine learning, imaging science, mathematical modeling and simulations, etc. This paper studies the nonparametric estimation of Lipschitz operators using deep neural networks. Non-asymptotic upper bounds are derived for the generalization error of the empirical risk minimizer over a properly chosen network class. Under the assumption that the target operator exhibits a low dimensional structure, our error bounds decay as the training sample size increases, with an attractive fast rate depending on the intrinsic dimension in our estimation. Our assumptions cover most scenarios in real applications and our results give rise to fast rates by exploiting low dimensional structures of data in operator estimation. We also investigate the influence of network structures (e.g., network width, depth, and sparsity) on the generalization error of the neural network estimator and propose a general suggestion on the choice of network structures to maximize the learning efficiency quantitatively.
Post-Regularization Confidence Bands for Ordinary Differential Equations
http://jmlr.org/papers/v25/22-0487.html
http://jmlr.org/papers/volume25/22-0487/22-0487.pdf
2024Xiaowu Dai, Lexin Li
Ordinary differential equation (ODE) is an important tool to study a system of biological and physical processes. A central question in ODE modeling is to infer the significance of individual regulatory effect of one signal variable on another. However, building confidence band for ODE with unknown regulatory relations is challenging, and it remains largely an open question. In this article, we construct the post-regularization confidence band for the individual regulatory function in ODE with unknown functionals and noisy data observations. Our proposal is the first of its kind, and is built on two novel ingredients. The first is a new localized kernel learning approach that combines reproducing kernel learning with local Taylor approximation, and the second is a new de-biasing method that tackles infinite-dimensional functionals and additional measurement errors. We show that the constructed confidence band has the desired asymptotic coverage probability, and the recovered regulatory network approaches the truth with probability tending to one. We establish the theoretical properties when the number of variables in the system can be either smaller or larger than the number of sampling time points, and we study the regime-switching phenomenon. We demonstrate the efficacy of the proposed method through both simulations and illustrations with two data applications.
On the Generalization of Stochastic Gradient Descent with Momentum
http://jmlr.org/papers/v25/22-0068.html
http://jmlr.org/papers/volume25/22-0068/22-0068.pdf
2024Ali Ramezani-Kebrya, Kimon Antonakopoulos, Volkan Cevher, Ashish Khisti, Ben Liang
While momentum-based accelerated variants of stochastic gradient descent (SGD) are widely used when training machine learning models, there is little theoretical understanding on the generalization error of such methods. In this work, we first show that there exists a convex loss function for which the stability gap for multiple epochs of SGD with standard heavy-ball momentum (SGDM) becomes unbounded. Then, for smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, i.e., SGD with early momentum (SGDEM) under a broad range of step-sizes, and show that it can train machine learning models for multiple epochs with a guarantee for generalization. Finally, for the special case of strongly convex loss functions, we find a range of momentum such that multiple epochs of standard SGDM, as a special form of SGDEM, also generalizes. Extending our results on generalization, we also develop an upper bound on the expected true risk, in terms of the number of training steps, sample size, and momentum. Our experimental evaluations verify the consistency between the numerical results and our theoretical bounds. SGDEM improves the generalization error of SGDM when training ResNet-18 on ImageNet in practical distributed settings.
Pursuit of the Cluster Structure of Network Lasso: Recovery Condition and Non-convex Extension
http://jmlr.org/papers/v25/21-1190.html
http://jmlr.org/papers/volume25/21-1190/21-1190.pdf
2024Shotaro Yagishita, Jun-ya Gotoh
Network lasso (NL for short) is a technique for estimating models by simultaneously clustering data samples and fitting the models to them. It often succeeds in forming clusters thanks to the geometry of the sum of $\ell_2$ norm employed therein, but there may be limitations due to the convexity of the regularizer. This paper focuses on clustering generated by NL and strengthens it by creating a non-convex extension, called network trimmed lasso (NTL for short). Specifically, we initially investigate a sufficient condition that guarantees the recovery of the latent cluster structure of NL on the basis of the result of Sun et al. (2021) for convex clustering, which is a special case of NL for ordinary clustering. Second, we extend NL to NTL to incorporate a cardinality (or, $\ell_0$-)constraint and rewrite the constrained optimization problem defined with the $\ell_0$ norm, a discontinuous function, into an equivalent unconstrained continuous optimization problem. We develop ADMM algorithms to solve NTL and show their convergence results. Numerical illustrations indicate that the non-convex extension provides a more clear-cut cluster structure when NL fails to form clusters without incorporating prior knowledge of the associated parameters.
Iterate Averaging in the Quest for Best Test Error
http://jmlr.org/papers/v25/21-1125.html
http://jmlr.org/papers/volume25/21-1125/21-1125.pdf
2024Diego Granziol, Nicholas P. Baskerville, Xingchen Wan, Samuel Albanie, Stephen Roberts
We analyse and explain the increased generalisation performance of iterate averaging using a Gaussian process perturbation model between the true and batch risk surface on the high dimensional quadratic. We derive three phenomena from our theoretical results: (1) The importance of combining iterate averaging (IA) with large learning rates and regularisation for improved generalisation. (2) Justification for less frequent averaging. (3) That we expect adaptive gradient methods to work equally well, or better, with iterate averaging than their non-adaptive counterparts. Inspired by these results, together with empirical investigations of the importance of appropriate regularisation for the solution diversity of the iterates, we propose two adaptive algorithms with iterate averaging. These give significantly better results compared to stochastic gradient descent (SGD), require less tuning and do not require early stopping or validation set monitoring. We showcase the efficacy of our approach on the CIFAR-10/100, ImageNet and Penn Treebank datasets on a variety of modern and classical network architectures.
Nonparametric Inference under B-bits Quantization
http://jmlr.org/papers/v25/20-075.html
http://jmlr.org/papers/volume25/20-075/20-075.pdf
2024Kexuan Li, Ruiqi Liu, Ganggang Xu, Zuofeng Shang
Statistical inference based on lossy or incomplete samples is often needed in research areas such as signal/image processing, medical image storage, remote sensing, signal transmission. In this paper, we propose a nonparametric testing procedure based on samples quantized to $B$ bits through a computationally efficient algorithm. Under mild technical conditions, we establish the asymptotic properties of the proposed test statistic and investigate how the testing power changes as $B$ increases. In particular, we show that if $B$ exceeds a certain threshold, the proposed nonparametric testing procedure achieves the classical minimax rate of testing (Shang and Cheng, 2015) for spline models. We further extend our theoretical investigations to a nonparametric linearity test and an adaptive nonparametric test, expanding the applicability of the proposed methods. Extensive simulation studies {together with a real-data analysis} are used to demonstrate the validity and effectiveness of the proposed tests.
Black Box Variational Inference with a Deterministic Objective: Faster, More Accurate, and Even More Black Box
http://jmlr.org/papers/v25/23-1015.html
http://jmlr.org/papers/volume25/23-1015/23-1015.pdf
2024Ryan Giordano, Martin Ingram, Tamara Broderick
Automatic differentiation variational inference (ADVI) offers fast and easy-to-use posterior approximation in multiple modern probabilistic programming languages. However, its stochastic optimizer lacks clear convergence criteria and requires tuning parameters. Moreover, ADVI inherits the poor posterior uncertainty estimates of mean-field variational Bayes (MFVB). We introduce "deterministic ADVI" (DADVI) to address these issues. DADVI replaces the intractable MFVB objective with a fixed Monte Carlo approximation, a technique known in the stochastic optimization literature as the "sample average approximation" (SAA). By optimizing an approximate but deterministic objective, DADVI can use off-the-shelf second-order optimization, and, unlike standard mean-field ADVI, is amenable to more accurate posterior covariances via linear response (LR). In contrast to existing worst-case theory, we show that, on certain classes of common statistical problems, DADVI and the SAA can perform well with relatively few samples even in very high dimensions, though we also show that such favorable results cannot extend to variational approximations that are too expressive relative to mean-field ADVI. We show on a variety of real-world problems that DADVI reliably finds good solutions with default settings (unlike ADVI) and, together with LR covariances, is typically faster and more accurate than standard ADVI.
On Sufficient Graphical Models
http://jmlr.org/papers/v25/23-0893.html
http://jmlr.org/papers/volume25/23-0893/23-0893.pdf
2024Bing Li, Kyongwon Kim
We introduce a sufficient graphical model by applying the recently developed nonlinear sufficient dimension reduction techniques to the evaluation of conditional independence. The graphical model is nonparametric in nature, as it does not make distributional assumptions such as the Gaussian or copula Gaussian assumptions. However, unlike a fully nonparametric graphical model, which relies on the high-dimensional kernel to characterize conditional independence, our graphical model is based on conditional independence given a set of sufficient predictors with a substantially reduced dimension. In this way we avoid the curse of dimensionality that comes with a high-dimensional kernel. We develop the population-level properties, convergence rate, and variable selection consistency of our estimate. By simulation comparisons and an analysis of the DREAM 4 Challenge data set, we demonstrate that our method outperforms the existing methods when the Gaussian or copula Gaussian assumptions are violated, and its performance remains excellent in the high-dimensional setting.
Localized Debiased Machine Learning: Efficient Inference on Quantile Treatment Effects and Beyond
http://jmlr.org/papers/v25/23-0661.html
http://jmlr.org/papers/volume25/23-0661/23-0661.pdf
2024Nathan Kallus, Xiaojie Mao, Masatoshi Uehara
We consider estimating a low-dimensional parameter in an estimating equation involving high-dimensional nuisance functions that depend on the target parameter as an input. A central example is the efficient estimating equation for the (local) quantile treatment effect ((L)QTE) in causal inference, which involves the covariate-conditional cumulative distribution function evaluated at the quantile to be estimated. Existing approaches based on flexibly estimating the nuisances and plugging in the estimates, such as debiased machine learning (DML), require we learn the nuisance at all possible inputs. For (L)QTE, DML requires we learn the whole covariate-conditional cumulative distribution function. We instead propose localized debiased machine learning (LDML), which avoids this burdensome step and needs only estimate nuisances at a single initial rough guess for the target parameter. For (L)QTE, LDML involves learning just two regression functions, a standard task for machine learning methods. We prove that under lax rate conditions our estimator has the same favorable asymptotic behavior as the infeasible estimator that uses the unknown true nuisances. Thus, LDML notably enables practically-feasible and theoretically-grounded efficient estimation of important quantities in causal inference such as (L)QTEs when we must control for many covariates and/or flexible relationships, as we demonstrate in empirical studies.
On the Effect of Initialization: The Scaling Path of 2-Layer Neural Networks
http://jmlr.org/papers/v25/23-0549.html
http://jmlr.org/papers/volume25/23-0549/23-0549.pdf
2024Sebastian Neumayer, Lénaïc Chizat, Michael Unser
In supervised learning, the regularization path is sometimes used as a convenient theoretical proxy for the optimization path of gradient descent initialized from zero. In this paper, we study a modification of the regularization path for infinite-width 2-layer ReLU neural networks with nonzero initial distribution of the weights at different scales. By exploiting a link with unbalanced optimal-transport theory, we show that, despite the non-convexity of the 2-layer network training, this problem admits an infinite-dimensional convex counterpart. We formulate the corresponding functional-optimization problem and investigate its main properties. In particular, we show that, as the scale of the initialization ranges between $0$ and $+\infty$, the associated path interpolates continuously between the so-called kernel and rich regimes. Numerical experiments confirm that, in our setting, the scaling path and the final states of the optimization path behave similarly, even beyond these extreme points.
Improving physics-informed neural networks with meta-learned optimization
http://jmlr.org/papers/v25/23-0356.html
http://jmlr.org/papers/volume25/23-0356/23-0356.pdf
2024Alex Bihlo
We show that the error achievable using physics-informed neural networks for solving differential equations can be substantially reduced when these networks are trained using meta-learned optimization methods rather than using fixed, hand-crafted optimizers as traditionally done. We choose a learnable optimization method based on a shallow multi-layer perceptron that is meta-trained for specific classes of differential equations. We illustrate meta-trained optimizers for several equations of practical relevance in mathematical physics, including the linear advection equation, Poisson's equation, the Korteweg-de Vries equation and Burgers' equation. We also illustrate that meta-learned optimizers exhibit transfer learning abilities, in that a meta-trained optimizer on one differential equation can also be successfully deployed on another differential equation.
A Comparison of Continuous-Time Approximations to Stochastic Gradient Descent
http://jmlr.org/papers/v25/23-0237.html
http://jmlr.org/papers/volume25/23-0237/23-0237.pdf
2024Stefan Ankirchner, Stefan Perko
Applying a stochastic gradient descent (SGD) method for minimizing an objective gives rise to a discrete-time process of estimated parameter values. In order to better understand the dynamics of the estimated values, many authors have considered continuous-time approximations of SGD. We refine existing results on the weak error of first-order ODE and SDE approximations to SGD for non-infinitesimal learning rates. In particular, we explicitly compute the linear term in the error expansion of gradient flow and two of its stochastic counterparts, with respect to a discretization parameter $h$. In the example of linear regression, we demonstrate the general inferiority of the deterministic gradient flow approximation in comparison to the stochastic ones, for batch sizes which are not too large. Further, we demonstrate that for Gaussian features an SDE approximation with state-independent noise (CC) is preferred over using a state-dependent coefficient (NCC). The same comparison holds true for features of low kurtosis or large batch sizes. However, the relationship reverses for highly leptokurtic features or small batch sizes.
Critically Assessing the State of the Art in Neural Network Verification
http://jmlr.org/papers/v25/23-0119.html
http://jmlr.org/papers/volume25/23-0119/23-0119.pdf
2024Matthias König, Annelot W. Bosman, Holger H. Hoos, Jan N. van Rijn
Recent research has proposed various methods to formally verify neural networks against minimal input perturbations; this verification task is also known as local robustness verification. The research area of local robustness verification is highly diverse, as verifiers rely on a multitude of techniques, including mixed integer programming and satisfiability modulo theories. At the same time, the problem instances encountered when performing local robustness verification differ based on the network to be verified, the property to be verified and the specific network input. This raises the question of which verification algorithm is most suitable for solving specific types of instances of the local robustness verification problem. To answer this question, we performed a systematic performance analysis of several CPU- and GPU-based local robustness verification systems on a newly and carefully assembled set of 79 neural networks, of which we verified a broad range of robustness properties, while taking a practitioner's point of view -- a perspective that complements the insights from initiatives such as the VNN competition, where the participating tools are carefully adapted to the given benchmarks by their developers. Notably, we show that no single best algorithm dominates performance across all verification problem instances. Instead, our results reveal complementarities in verifier performance and illustrate the potential of leveraging algorithm portfolios for more efficient local robustness verification. We quantify this complementarity using various performance measures, such as the Shapley value. Furthermore, we confirm the notion that most algorithms only support ReLU-based networks, while other activation functions remain under-supported.
Estimating the Minimizer and the Minimum Value of a Regression Function under Passive Design
http://jmlr.org/papers/v25/22-1396.html
http://jmlr.org/papers/volume25/22-1396/22-1396.pdf
2024Arya Akhavan, Davit Gogolashvili, Alexandre B. Tsybakov
We propose a new method for estimating the minimizer $\boldsymbol{x}^*$ and the minimum value $f^*$ of a smooth and strongly convex regression function $f$ from the observations contaminated by random noise. Our estimator $\boldsymbol{z}_n$ of the minimizer $\boldsymbol{x}^*$ is based on a version of the projected gradient descent with the gradient estimated by a regularized local polynomial algorithm. Next, we propose a two-stage procedure for estimation of the minimum value $f^*$ of regression function $f$. At the first stage, we construct an accurate enough estimator of $\boldsymbol{x}^*$, which can be, for example, $\boldsymbol{z}_n$. At the second stage, we estimate the function value at the point obtained in the first stage using a rate optimal nonparametric procedure. We derive non-asymptotic upper bounds for the quadratic risk and optimization risk of $\boldsymbol{z}_n$, and for the risk of estimating $f^*$. We establish minimax lower bounds showing that, under certain choice of parameters, the proposed algorithms achieve the minimax optimal rates of convergence on the class of smooth and strongly convex functions.
Modeling Random Networks with Heterogeneous Reciprocity
http://jmlr.org/papers/v25/22-1317.html
http://jmlr.org/papers/volume25/22-1317/22-1317.pdf
2024Daniel Cirkovic, Tiandong Wang
Reciprocity, or the tendency of individuals to mirror behavior, is a key measure that describes information exchange in a social network. Users in social networks tend to engage in different levels of reciprocal behavior. Differences in such behavior may indicate the existence of communities that reciprocate links at varying rates. In this paper, we develop methodology to model the diverse reciprocal behavior in growing social networks. In particular, we present a preferential attachment model with heterogeneous reciprocity that imitates the attraction users have for popular users, plus the heterogeneous nature by which they reciprocate links. We compare Bayesian and frequentist model fitting techniques for large networks, as well as computationally efficient variational alternatives. Cases where the number of communities is known and unknown are both considered. We apply the presented methods to the analysis of Facebook and Reddit networks where users have non-uniform reciprocal behavior patterns. The fitted model captures the heavy-tailed nature of the empirical degree distributions in the datasets and identifies multiple groups of users that differ in their tendency to reply to and receive responses to wallposts and comments.
Exploration, Exploitation, and Engagement in Multi-Armed Bandits with Abandonment
http://jmlr.org/papers/v25/22-1251.html
http://jmlr.org/papers/volume25/22-1251/22-1251.pdf
2024Zixian Yang, Xin Liu, Lei Ying
The traditional multi-armed bandit (MAB) model for recommendation systems assumes the user stays in the system for the entire learning horizon. In new online education platforms such as ALEKS or new video recommendation systems such as TikTok, the amount of time a user spends on the app depends on how engaging the recommended contents are. Users may temporarily leave the system if the recommended items cannot engage the users. To understand the exploration, exploitation, and engagement in these systems, we propose a new model, called MAB-A where “A” stands for abandonment and the abandonment probability depends on the current recommended item and the user's past experience (called state). We propose two algorithms, ULCB and KL-ULCB, both of which do more exploration (being optimistic) when the user likes the previous recommended item and less exploration (being pessimistic) when the user does not. We prove that both ULCB and KL-ULCB achieve logarithmic regret, $O(\log K)$, where $K$ is the number of visits (or episodes). Furthermore, the regret bound under KL-ULCB is asymptotically sharp. We also extend the proposed algorithms to the general-state setting. Simulation results show that the proposed algorithms have significantly lower regret than the traditional UCB and KL-UCB, and Q-learning-based algorithms.
On Efficient and Scalable Computation of the Nonparametric Maximum Likelihood Estimator in Mixture Models
http://jmlr.org/papers/v25/22-1120.html
http://jmlr.org/papers/volume25/22-1120/22-1120.pdf
2024Yangjing Zhang, Ying Cui, Bodhisattva Sen, Kim-Chuan Toh
In this paper, we focus on the computation of the nonparametric maximum likelihood estimator (NPMLE) in multivariate mixture models. Our approach discretizes this infinite dimensional convex optimization problem by setting fixed support points for the NPMLE and optimizing over the mixing proportions. We propose an efficient and scalable semismooth Newton based augmented Lagrangian method (ALM). Our algorithm outperforms the state-of-the-art methods (Kim et al., 2020; Koenker and Gu, 2017), capable of handling $n \approx 10^6$ data points with $m \approx 10^4$ support points. A key advantage of our approach is its strategic utilization of the solution's sparsity, leading to structured sparsity in Hessian computations. As a result, our algorithm demonstrates better scaling in terms of $m$ when compared to the mixsqp method (Kim et al., 2020). The computed NPMLE can be directly applied to denoising the observations in the framework of empirical Bayes. We propose new denoising estimands in this context along with their consistent estimates. Extensive numerical experiments are conducted to illustrate the efficiency of our ALM. In particular, we employ our method to analyze two astronomy data sets: (i) Gaia-TGAS Catalog (Anderson et al., 2018) containing approximately $1.4 \times 10^6$ data points in two dimensions, and (ii) a data set from the APOGEE survey (Majewski et al., 2017) with approximately $2.7 \times 10^4$ data points.
Decorrelated Variable Importance
http://jmlr.org/papers/v25/22-0801.html
http://jmlr.org/papers/volume25/22-0801/22-0801.pdf
2024Isabella Verdinelli, Larry Wasserman
Because of the widespread use of black box prediction methods such as random forests and neural nets, there is renewed interest in developing methods for quantifying variable importance as part of the broader goal of interpretable prediction. A popular approach is to define a variable importance parameter --- known as LOCO (Leave Out COvariates) --- based on dropping covariates from a regression model. This is essentially a nonparametric version of $R^2$. This parameter is very general and can be estimated nonparametrically, but it can be hard to interpret because it is affected by correlation between covariates. We propose a method for mitigating the effect of correlation by defining a modified version of LOCO. This new parameter is difficult to estimate nonparametrically, but we show how to estimate it using semiparametric models.
Model-Free Representation Learning and Exploration in Low-Rank MDPs
http://jmlr.org/papers/v25/22-0687.html
http://jmlr.org/papers/volume25/22-0687/22-0687.pdf
2024Aditya Modi, Jinglin Chen, Akshay Krishnamurthy, Nan Jiang, Alekh Agarwal
The low-rank MDP has emerged as an important model for studying representation learning and exploration in reinforcement learning. With a known representation, several model-free exploration strategies exist. In contrast, all algorithms for the unknown representation setting are model-based, thereby requiring the ability to model the full dynamics. In this work, we present the first model-free representation learning algorithms for low-rank MDPs. The key algorithmic contribution is a new minimax representation learning objective, for which we provide variants with differing tradeoffs in their statistical and computational properties. We interleave this representation learning step with an exploration strategy to cover the state space in a reward-free manner. The resulting algorithms are provably sample efficient and can accommodate general function approximation to scale to complex environments.
Seeded Graph Matching for the Correlated Gaussian Wigner Model via the Projected Power Method
http://jmlr.org/papers/v25/22-0402.html
http://jmlr.org/papers/volume25/22-0402/22-0402.pdf
2024Ernesto Araya, Guillaume Braun, Hemant Tyagi
In the graph matching problem we observe two graphs $G,H$ and the goal is to find an assignment (or matching) between their vertices such that some measure of edge agreement is maximized. We assume in this work that the observed pair $G,H$ has been drawn from the Correlated Gaussian Wigner (CGW) model -- a popular model for correlated weighted graphs -- where the entries of the adjacency matrices of $G$ and $H$ are independent Gaussians and each edge of $G$ is correlated with one edge of $H$ (determined by the unknown matching) with the edge correlation described by a parameter $\sigma \in [0,1)$. In this paper, we analyse the performance of the projected power method (PPM) as a seeded graph matching algorithm where we are given an initial partially correct matching (called the seed) as side information. We prove that if the seed is close enough to the ground-truth matching, then with high probability, PPM iteratively improves the seed and recovers the ground-truth matching (either partially or exactly) in $O(\log n)$ iterations. Our results prove that PPM works even in regimes of constant $\sigma$, thus extending the analysis in (Mao et al., 2023) for the sparse Correlated Erdos-Renyi (CER) model to the (dense) CGW model. As a byproduct of our analysis, we see that the PPM framework generalizes some of the state-of-art algorithms for seeded graph matching. We support and complement our theoretical findings with numerical experiments on synthetic data.
Fast Policy Extragradient Methods for Competitive Games with Entropy Regularization
http://jmlr.org/papers/v25/21-1205.html
http://jmlr.org/papers/volume25/21-1205/21-1205.pdf
2024Shicong Cen, Yuting Wei, Yuejie Chi
This paper investigates the problem of computing the equilibrium of competitive games in the form of two-player zero-sum games, which is often modeled as a constrained saddle-point optimization problem with probability simplex constraints. Despite recent efforts in understanding the last-iterate convergence of extragradient methods in the unconstrained setting, the theoretical underpinnings of these methods in the constrained settings, especially those using multiplicative updates, remain highly inadequate, even when the objective function is bilinear. Motivated by the algorithmic role of entropy regularization in single-agent reinforcement learning and game theory, we develop provably efficient extragradient methods to find the quantal response equilibrium (QRE)---which are solutions to zero-sum two-player matrix games with entropy regularization---at a linear rate. The proposed algorithms can be implemented in a decentralized manner, where each player executes symmetric and multiplicative updates iteratively using its own payoff without observing the opponent's actions directly. In addition, by controlling the knob of entropy regularization, the proposed algorithms can locate an approximate Nash equilibrium of the unregularized matrix game at a sublinear rate without assuming the Nash equilibrium to be unique. Our methods also lead to efficient policy extragradient algorithms for solving (entropy-regularized) zero-sum Markov games at similar rates. All of our convergence rates are nearly dimension-free, which are independent of the size of the state and action spaces up to logarithm factors, highlighting the positive role of entropy regularization for accelerating convergence.
Power of knockoff: The impact of ranking algorithm, augmented design, and symmetric statistic
http://jmlr.org/papers/v25/21-1137.html
http://jmlr.org/papers/volume25/21-1137/21-1137.pdf
2024Zheng Tracy Ke, Jun S. Liu, Yucong Ma
The knockoff filter is a recent false discovery rate (FDR) control method for high-dimensional linear models. We point out that knockoff has three key components: ranking algorithm, augmented design, and symmetric statistic, and each component admits multiple choices. By considering various combinations of the three components, we obtain a collection of variants of knockoff. All these variants guarantee finite-sample FDR control, and our goal is to compare their power. We assume a Rare and Weak signal model on regression coeffi- cients and compare the power of different variants of knockoff by deriving explicit formulas of false positive rate and false negative rate. Our results provide new insights on how to improve power when controlling FDR at a targeted level. We also compare the power of knockoff with its propotype - a method that uses the same ranking algorithm but has access to an ideal threshold. The comparison reveals the additional price one pays by finding a data-driven threshold to control FDR.
Lower Complexity Bounds of Finite-Sum Optimization Problems: The Results and Construction
http://jmlr.org/papers/v25/21-0264.html
http://jmlr.org/papers/volume25/21-0264/21-0264.pdf
2024Yuze Han, Guangzeng Xie, Zhihua Zhang
In this paper we study the lower complexity bounds for finite-sum optimization problems, where the objective is the average of $n$ individual component functions. We consider a so-called proximal incremental first-order oracle (PIFO) algorithm, which employs the individual component function's gradient and proximal information provided by PIFO to update the variable. To incorporate loopless methods, we also allow the PIFO algorithm to obtain the full gradient infrequently. We develop a novel approach to constructing the hard instances, which partitions the tridiagonal matrix of classical examples into $n$ groups. This construction is friendly to the analysis of PIFO algorithms. Based on this construction, we establish the lower complexity bounds for finite-sum minimax optimization problems when the objective is convex-concave or nonconvex-strongly-concave and the class of component functions is $L$-average smooth. Most of these bounds are nearly matched by existing upper bounds up to log factors. We also derive similar lower bounds for finite-sum minimization problems as previous work under both smoothness and average smoothness assumptions. Our lower bounds imply that proximal oracles for smooth functions are not much more powerful than gradient oracles.
On Truthing Issues in Supervised Classification
http://jmlr.org/papers/v25/19-301.html
http://jmlr.org/papers/volume25/19-301/19-301.pdf
2024Jonathan K. Su
Ideal supervised classification assumes known correct labels, but various truthing issues can arise in practice: noisy labels; multiple, conflicting labels for a sample; missing labels; and different labeler combinations for different samples. Previous work introduced a noisy-label model, which views the observed noisy labels as random variables conditioned on the unobserved correct labels. It has mainly focused on estimating the conditional distribution of the noisy labels and the class prior, as well as estimating the correct labels or training with noisy labels. In a complementary manner, given the conditional distribution and class prior, we apply estimation theory to classifier testing, training, and comparison of different combinations of labelers. First, for binary classification, we construct a testing model and derive approximate marginal posteriors for accuracy, precision, recall, probability of false alarm, and F-score, and joint posteriors for ROC and precision-recall analysis. We propose minimum mean-square error (MMSE) testing, which employs empirical Bayes algorithms to estimate the testing-model parameters and then computes optimal point estimates and credible regions for the metrics. We extend the approach to multi-class classification to obtain optimal estimates of accuracy and individual confusion-matrix elements. Second, we present a unified view of training that covers probabilistic (i.e., discriminative or generative) and non-probabilistic models. For the former, we adjust maximum-likelihood or maximum a posteriori training for truthing issues; for the latter, we propose MMSE training, which minimizes the MMSE estimate of the empirical risk. We also describe suboptimal training that is compatible with existing infrastructure. Third, we observe that mutual information lets one express any labeler combination as an equivalent single labeler, implying that multiple mediocre labelers can be as informative as, or more informative than, a single expert labeler. Experiments demonstrate the effectiveness of the methods and confirm the implication.