http://www.jmlr.org
JMLRJournal of Machine Learning Research
Bagging in overparameterized learning: Risk characterization and risk monotonization
http://jmlr.org/papers/v24/23-0887.html
http://jmlr.org/papers/volume24/23-0887/23-0887.pdf
2023Pratik Patil, Jin-Hong Du, Arun Kumar Kuchibhotla
Bagging is a commonly used ensemble technique in statistics and machine learning to improve the performance of prediction procedures. In this paper, we study the prediction risk of variants of bagged predictors under the proportional asymptotics regime, in which the ratio of the number of features to the number of observations converges to a constant. Specifically, we propose a general strategy to analyze the prediction risk under squared error loss of bagged predictors using classical results on simple random sampling. Specializing the strategy, we derive the exact asymptotic risk of the bagged ridge and ridgeless predictors with an arbitrary number of bags under a well-specified linear model with arbitrary feature covariance matrices and signal vectors. Furthermore, we prescribe a generic cross-validation procedure to select the optimal subsample size for bagging and discuss its utility to eliminate the non-monotonic behavior of the limiting risk in the sample size (i.e., double or multiple descents). In demonstrating the proposed procedure for bagged ridge and ridgeless predictors, we thoroughly investigate the oracle properties of the optimal subsample size and provide an in-depth comparison between different bagging variants.
Operator learning with PCA-Net: upper and lower complexity bounds
http://jmlr.org/papers/v24/23-0478.html
http://jmlr.org/papers/volume24/23-0478/23-0478.pdf
2023Samuel Lanthaler
PCA-Net is a recently proposed neural operator architecture which combines principal component analysis (PCA) with neural networks to approximate operators between infinite-dimensional function spaces. The present work develops approximation theory for this approach, improving and significantly extending previous work in this direction: First, a novel universal approximation result is derived, under minimal assumptions on the underlying operator and the data-generating distribution. Then, two potential obstacles to efficient operator learning with PCA-Net are identified, and made precise through lower complexity bounds; the first relates to the complexity of the output distribution, measured by a slow decay of the PCA eigenvalues. The other obstacle relates to the inherent complexity of the space of operators between infinite-dimensional input and output spaces, resulting in a rigorous and quantifiable statement of a “curse of parametric complexity”, an infinite-dimensional analogue of the well-known curse of dimensionality encountered in high-dimensional approximation problems. In addition to these lower bounds, upper complexity bounds are finally derived. A suitable smoothness criterion is shown to ensure an algebraic decay of the PCA eigenvalues. Furthermore, it is shown that PCA-Net can overcome the general curse for specific operators of interest, arising from the Darcy flow and the Navier-Stokes equations.
Mixed Regression via Approximate Message Passing
http://jmlr.org/papers/v24/23-0473.html
http://jmlr.org/papers/volume24/23-0473/23-0473.pdf
2023Nelvin Tan, Ramji Venkataramanan
We study the problem of regression in a generalized linear model (GLM) with multiple signals and latent variables. This model, which we call a matrix GLM, covers many widely studied problems in statistical learning, including mixed linear regression, max-affine regression, and mixture-of-experts. The goal in all these problems is to estimate the signals, and possibly some of the latent variables, from the observations. We propose a novel approximate message passing (AMP) algorithm for estimation in a matrix GLM and rigorously characterize its performance in the high-dimensional limit. This characterization is in terms of a state evolution recursion, which allows us to precisely compute performance measures such as the asymptotic mean-squared error. The state evolution characterization can be used to tailor the AMP algorithm to take advantage of any structural information known about the signals. Using state evolution, we derive an optimal choice of AMP `denoising' functions that minimizes the estimation error in each iteration. The theoretical results are validated by numerical simulations for mixed linear regression, max-affine regression, and mixture-of-experts. For max-affine regression, we propose an algorithm that combines AMP with expectation-maximization to estimate the intercepts of the model along with the signals. The numerical results show that AMP significantly outperforms other estimators for mixed linear regression and max-affine regression in most parameter regimes.
The Dynamics of Sharpness-Aware Minimization: Bouncing Across Ravines and Drifting Towards Wide Minima
http://jmlr.org/papers/v24/23-043.html
http://jmlr.org/papers/volume24/23-043/23-043.pdf
2023Peter L. Bartlett, Philip M. Long, Olivier Bousquet
We consider Sharpness-Aware Minimization (SAM), a gradient-based optimization method for deep networks that has exhibited performance improvements on image and language prediction problems. We show that when SAM is applied with a convex quadratic objective, for most random initializations it converges to a cycle that oscillates between either side of the minimum in the direction with the largest curvature, and we provide bounds on the rate of convergence. In the non-quadratic case, we show that such oscillations effectively perform gradient descent, with a smaller step-size, on the spectral norm of the Hessian. In such cases, SAM's update may be regarded as a third derivative---the derivative of the Hessian in the leading eigenvector direction---that encourages drift toward wider minima.
MARLlib: A Scalable and Efficient Multi-agent Reinforcement Learning Library
http://jmlr.org/papers/v24/23-0378.html
http://jmlr.org/papers/volume24/23-0378/23-0378.pdf
2023Siyi Hu, Yifan Zhong, Minquan Gao, Weixun Wang, Hao Dong, Xiaodan Liang, Zhihui Li, Xiaojun Chang, Yaodong Yang
A significant challenge facing researchers in the area of multi-agent reinforcement learning (MARL) pertains to the identification of a library that can offer fast and compatible development for multi-agent tasks and algorithm combinations, while obviating the need to consider compatibility issues. In this paper, we present MARLlib, a library designed to address the aforementioned challenge by leveraging three key mechanisms: 1) a standardized multi-agent environment wrapper, 2) an agent-level algorithm implementation, and 3) a flexible policy mapping strategy. By utilizing these mechanisms, MARLlib can effectively disentangle the intertwined nature of the multi-agent task and the learning process of the algorithm, with the ability to automatically alter the training strategy based on the current task's attributes. The MARLlib library's source code is publicly accessible on GitHub: https://github.com/Replicable-MARL/MARLlib.
Fast Expectation Propagation for Heteroscedastic, Lasso-Penalized, and Quantile Regression
http://jmlr.org/papers/v24/23-0104.html
http://jmlr.org/papers/volume24/23-0104/23-0104.pdf
2023Jackson Zhou, John T. Ormerod, Clara Grazian
Expectation propagation (EP) is an approximate Bayesian inference (ABI) method which has seen widespread use across machine learning and statistics, owing to its accuracy and speed. However, it is often difficult to apply EP to models with complex likelihoods, where the EP updates do not have a tractable form and need to be calculated using methods such as multivariate numerical quadrature. These methods increase run time and reduce the appeal of EP as a fast approximate method. In this paper, we demonstrate that EP can still be made fast for certain models in this category. We focus on various types of linear regression, for which fast Bayesian inference is becoming increasingly important in the transition to big data. Fast EP updates are achieved through analytic integral reductions in certain moment computations. EP is compared to other ABI methods across simulations and benchmark datasets, and is shown to offer a good balance between accuracy and speed.
Zeroth-Order Alternating Gradient Descent Ascent Algorithms for A Class of Nonconvex-Nonconcave Minimax Problems
http://jmlr.org/papers/v24/22-1518.html
http://jmlr.org/papers/volume24/22-1518/22-1518.pdf
2023Zi Xu, Zi-Qi Wang, Jun-Lin Wang, Yu-Hong Dai
In this paper, we consider a class of nonconvex-nonconcave minimax problems, i.e., NC-PL minimax problems, whose objective functions satisfy the Polyak-Lojasiewicz (PL) condition with respect to the inner variable. We propose a zeroth-order alternating gradient descent ascent (ZO-AGDA) algorithm and a zeroth-order variance reduced alternating gradient descent ascent (ZO-VRAGDA) algorithm for solving NC-PL minimax problem under the deterministic and the stochastic setting, respectively. The total number of function value queries to obtain an $\epsilon$-stationary point of ZO-AGDA and ZO-VRAGDA algorithm for solving NC-PL minimax problem is upper bounded by $\mathcal{O}(\varepsilon^{-2})$ and $\mathcal{O}(\varepsilon^{-3})$, respectively. To the best of our knowledge, they are the first two zeroth-order algorithms with the iteration complexity gurantee for solving NC-PL minimax problems.
The Measure and Mismeasure of Fairness
http://jmlr.org/papers/v24/22-1511.html
http://jmlr.org/papers/volume24/22-1511/22-1511.pdf
2023Sam Corbett-Davies, Johann D. Gaebler, Hamed Nilforoshan, Ravi Shroff, Sharad Goel
The field of fair machine learning aims to ensure that decisions guided by algorithms are equitable. Over the last decade, several formal, mathematical definitions of fairness have gained prominence. Here we first assemble and categorize these definitions into two broad families: (1) those that constrain the effects of decisions on disparities; and (2) those that constrain the effects of legally protected characteristics, like race and gender, on decisions. We then show, analytically and empirically, that both families of definitions typically result in strongly Pareto dominated decision policies. For example, in the case of college admissions, adhering to popular formal conceptions of fairness would simultaneously result in lower student-body diversity and a less academically prepared class, relative to what one could achieve by explicitly tailoring admissions policies to achieve desired outcomes. In this sense, requiring that these fairness definitions hold can, perversely, harm the very groups they were designed to protect. In contrast to axiomatic notions of fairness, we argue that the equitable design of algorithms requires grappling with their context-specific consequences, akin to the equitable design of policy. We conclude by listing several open challenges in fair machine learning and offering strategies to ensure algorithms are better aligned with policy goals.
Microcanonical Hamiltonian Monte Carlo
http://jmlr.org/papers/v24/22-1450.html
http://jmlr.org/papers/volume24/22-1450/22-1450.pdf
2023Jakob Robnik, G. Bruno De Luca, Eva Silverstein, Uroš Seljak
We develop Microcanonical Hamiltonian Monte Carlo (MCHMC), a class of models that follow fixed energy Hamiltonian dynamics, in contrast to Hamiltonian Monte Carlo (HMC), which follows canonical distribution with different energy levels. MCHMC tunes the Hamiltonian function such that the marginal of the uniform distribution on the constant-energy-surface over the momentum variables gives the desired target distribution. We show that MCHMC requires occasional energy-conserving billiard-like momentum bounces for ergodicity, analogous to momentum resampling in HMC. We generalize the concept of bounces to a continuous version with partial direction preserving bounces at every step, which gives energy-conserving underdamped Langevin-like dynamics with non-Gaussian noise (MCLMC). MCHMC and MCLMC exhibit favorable scalings with condition number and dimensionality. We develop an efficient hyperparameter tuning scheme that achieves high performance and consistently outperforms NUTS HMC on several standard benchmark problems, in some cases by orders of magnitude.
Prediction Equilibrium for Dynamic Network Flows
http://jmlr.org/papers/v24/22-1446.html
http://jmlr.org/papers/volume24/22-1446/22-1446.pdf
2023Lukas Graf, Tobias Harks, Kostas Kollias, Michael Markl
We study a dynamic traffic assignment model, where agents base their instantaneous routing decisions on real-time delay predictions. We formulate a mathematically concise model and define dynamic prediction equilibrium (DPE) in which no agent can at any point during their journey improve their predicted travel time by switching to a different route. We demonstrate the versatility of our framework by showing that it subsumes the well-known full information and instantaneous information models, in addition to admitting further realistic predictors as special cases. We then proceed to derive properties of the predictors that ensure a dynamic prediction equilibrium exists. Additionally, we define $\varepsilon$-approximate DPE wherein no agent can improve their predicted travel time by more than $\varepsilon$ and provide further conditions of the predictors under which such an approximate equilibrium can be computed. Finally, we complement our theoretical analysis by an experimental study, in which we systematically compare the induced average travel times of different predictors, including two machine-learning based models trained on data gained from previously computed approximate equilibrium flows, both on synthetic and real world road networks.
Dimension Reduction and MARS
http://jmlr.org/papers/v24/22-1422.html
http://jmlr.org/papers/volume24/22-1422/22-1422.pdf
2023Yu Liu LIU, Degui Li, Yingcun Xia
The multivariate adaptive regression spline (MARS) is one of the popular estimation methods for nonparametric multivariate regression. However, as MARS is based on marginal splines, to incorporate interactions of covariates, products of the marginal splines must be used, which often leads to an unmanageable number of basis functions when the order of interaction is high and results in low estimation efficiency. In this paper, we improve the performance of MARS by using linear combinations of the covariates which achieve sufficient dimension reduction. The special basis functions of MARS facilitate the calculation of gradients of the regression function, and estimation of these linear combinations is obtained via eigen-analysis of the outer-product of the gradients. Under some technical conditions, the consistency property is established for the proposed estimation method. Numerical studies including both simulation and empirical applications show its effectiveness in dimension reduction and improvement over MARS and other commonly-used nonparametric methods in regression estimation and prediction.
Nevis'22: A Stream of 100 Tasks Sampled from 30 Years of Computer Vision Research
http://jmlr.org/papers/v24/22-1345.html
http://jmlr.org/papers/volume24/22-1345/22-1345.pdf
2023Jorg Bornschein, Alexandre Galashov, Ross Hemsley, Amal Rannen-Triki, Yutian Chen, Arslan Chaudhry, Xu Owen He, Arthur Douillard, Massimo Caccia, Qixuan Feng, Jiajun Shen, Sylvestre-Alvise Rebuffi, Kitty Stacpoole, Diego de las Casas, Will Hawkins, Angeliki Lazaridou, Yee Whye Teh, Andrei A. Rusu, Razvan Pascanu, Marc’Aurelio Ranzato
A shared goal of several machine learning communities like continual learning, meta-learning and transfer learning, is to design algorithms and models that efficiently and robustly adapt to unseen tasks. An even more ambitious goal is to build models that never stop adapting, and that become increasingly more efficient through time by suitably transferring the accrued knowledge. Beyond the study of the actual learning algorithm and model architecture, there are several hurdles towards our quest to build such models, such as the choice of learning protocol, metric of success and data needed to validate research hypotheses. In this work, we introduce the Never-Ending VIsual-classification Stream (NEVIS'22), a benchmark consisting of a stream of over 100 visual classification tasks, sorted chronologically and extracted from papers sampled uniformly from computer vision proceedings spanning the last three decades. The resulting stream reflects what the research community thought was meaningful at any point in time, and it serves as an ideal test bed to assess how well models can adapt to new tasks, and do so better and more efficiently as time goes by. Despite being limited to classification, the resulting stream has a rich diversity of tasks from OCR, to texture analysis, scene recognition, and so forth. The diversity is also reflected in the wide range of dataset sizes, spanning over four orders of magnitude. Overall, NEVIS'22 poses an unprecedented challenge for current sequential learning approaches due to the scale and diversity of tasks, yet with a low entry barrier as it is limited to a single modality and well understood supervised learning problems. Moreover, we provide a reference implementation including strong baselines and an evaluation protocol to compare methods in terms of their trade-off between accuracy and compute. We hope that NEVIS'22 can be useful to researchers working on continual learning, meta-learning, AutoML and more generally sequential learning, and help these communities join forces towards more robust models that efficiently adapt to a never ending stream of data.
Fast Screening Rules for Optimal Design via Quadratic Lasso Reformulation
http://jmlr.org/papers/v24/22-1305.html
http://jmlr.org/papers/volume24/22-1305/22-1305.pdf
2023Guillaume Sagnol, Luc Pronzato
The problems of Lasso regression and optimal design of experiments share a critical property: their optimal solutions are typically sparse, i.e., only a small fraction of the optimal variables are non-zero. Therefore, the identification of the support of an optimal solution reduces the dimensionality of the problem and can yield a substantial simplification of the calculations. It has recently been shown that linear regression with a squared $\ell_1$-norm sparsity-inducing penalty is equivalent to an optimal experimental design problem. In this work, we use this equivalence to derive safe screening rules that can be used to discard inessential samples. Compared to previously existing rules, the new tests are much faster to compute, especially for problems involving a parameter space of high dimension, and can be used dynamically within any iterative solver, with negligible computational overhead. Moreover, we show how an existing homotopy algorithm to compute the regularization path of the lasso method can be reparametrized with respect to the squared $\ell_1$-penalty. This allows the computation of a Bayes $c$-optimal design in a finite number of steps and can be several orders of magnitude faster than standard first-order algorithms. The efficiency of the new screening rules and of the homotopy algorithm are demonstrated on different examples based on real data.
Multi-Consensus Decentralized Accelerated Gradient Descent
http://jmlr.org/papers/v24/22-1210.html
http://jmlr.org/papers/volume24/22-1210/22-1210.pdf
2023Haishan Ye, Luo Luo, Ziang Zhou, Tong Zhang
his paper considers the decentralized convex optimization problem, which has a wide range of applications in large-scale machine learning, sensor networks, and control theory. We propose novel algorithms that achieve optimal computation complexity and near optimal communication complexity. Our theoretical results give affirmative answers to the open problem on whether there exists an algorithm that can achieve a communication complexity (nearly) matching the lower bound depending on the global condition number instead of the local one. Furthermore, the linear convergence of our algorithms only depends on the strong convexity of global objective and it does not require the local functions to be convex. The design of our methods relies on a novel integration of well-known techniques including Nesterov's acceleration, multi-consensus and gradient-tracking. Empirical studies show the outperformance of our methods for machine learning applications.
Continuous-in-time Limit for Bayesian Bandits
http://jmlr.org/papers/v24/22-1181.html
http://jmlr.org/papers/volume24/22-1181/22-1181.pdf
2023Yuhua Zhu, Zachary Izzo, Lexing Ying
This paper revisits the bandit problem in the Bayesian setting. The Bayesian approach formulates the bandit problem as an optimization problem, and the goal is to find the optimal policy which minimizes the Bayesian regret. One of the main challenges facing the Bayesian approach is that computation of the optimal policy is often intractable, especially when the length of the problem horizon or the number of arms is large. In this paper, we first show that under a suitable rescaling, the Bayesian bandit problem converges toward a continuous Hamilton-Jacobi-Bellman (HJB) equation. The optimal policy for the limiting HJB equation can be explicitly obtained for several common bandit problems, and we give numerical methods to solve the HJB equation when an explicit solution is not available. Based on these results, we propose an approximate Bayes-optimal policy for solving Bayesian bandit problems with large horizons. Our method has the added benefit that its computational cost does not increase as the horizon increases.
Two Sample Testing in High Dimension via Maximum Mean Discrepancy
http://jmlr.org/papers/v24/22-1136.html
http://jmlr.org/papers/volume24/22-1136/22-1136.pdf
2023Hanjia Gao, Xiaofeng Shao
Maximum Mean Discrepancy (MMD) has been widely used in the areas of machine learning and statistics to quantify the distance between two distributions in the $p$-dimensional Euclidean space. The asymptotic property of the sample MMD has been well studied when the dimension $p$ is fixed using the theory of U-statistic. As motivated by the frequent use of MMD test for data of moderate/high dimension, we propose to investigate the behavior of the sample MMD in a high-dimensional environment and develop a new studentized test statistic. Specifically, we obtain the central limit theorems for the studentized sample MMD as both the dimension $p$ and sample sizes $n,m$ diverge to infinity. Our results hold for a wide range of kernels, including popular Gaussian and Laplacian kernels, and also cover energy distance as a special case. We also derive the explicit rate of convergence under mild assumptions and our results suggest that the accuracy of normal approximation can improve with dimensionality. Additionally, we provide a general theory on the power analysis under the alternative hypothesis and show that our proposed test can detect difference between two distributions in the moderately high dimensional regime. Numerical simulations demonstrate the effectiveness of our proposed test statistic and normal approximation.
Random Feature Amplification: Feature Learning and Generalization in Neural Networks
http://jmlr.org/papers/v24/22-1132.html
http://jmlr.org/papers/volume24/22-1132/22-1132.pdf
2023Spencer Frei, Niladri S. Chatterji, Peter L. Bartlett
In this work, we provide a characterization of the feature-learning process in two-layer ReLU networks trained by gradient descent on the logistic loss following random initialization. We consider data with binary labels that are generated by an XOR-like function of the input features. We permit a constant fraction of the training labels to be corrupted by an adversary. We show that, although linear classifiers are no better than random guessing for the distribution we consider, two-layer ReLU networks trained by gradient descent achieve generalization error close to the label noise rate. We develop a novel proof technique that shows that at initialization, the vast majority of neurons function as random features that are only weakly correlated with useful features, and the gradient descent dynamics `amplify’ these weak, random features to strong, useful features.
Pivotal Estimation of Linear Discriminant Analysis in High Dimensions
http://jmlr.org/papers/v24/22-1032.html
http://jmlr.org/papers/volume24/22-1032/22-1032.pdf
2023Ethan X. Fang, Yajun Mei, Yuyang Shi, Qunzhi Xu, Tuo Zhao
We consider the linear discriminant analysis problem in the high-dimensional settings. In this work, we propose PANDA(PivotAl liNear Discriminant Analysis), a tuning insensitive method in the sense that it requires very little effort to tune the parameters. Moreover, we prove that PANDA achieves the optimal convergence rate in terms of both the estimation error and misclassification rate. Our theoretical results are backed up by thorough numerical studies using both simulated and real datasets. In comparison with the existing methods, we observe that our proposed PANDA yields equal or better performance, and requires substantially less effort in parameter tuning.
Learning Optimal Feedback Operators and their Sparse Polynomial Approximations
http://jmlr.org/papers/v24/22-1010.html
http://jmlr.org/papers/volume24/22-1010/22-1010.pdf
2023Karl Kunisch, Donato Vásquez-Varas, Daniel Walter
A learning based method for obtaining feedback laws for nonlinear optimal control problems is proposed. The learning problem is posed such that the open loop value function is its optimal solution. This infinite dimensional, function space, problem, is approximated by a polynomial ansatz and its convergence is analyzed. An $\ell_1$ penalty term is employed, which combined with the proximal point method, allows to find sparse solutions for the learning problem. The approach requires multiple evaluations of the elements of the polynomial basis and of their derivatives. In order to do this efficiently a graph-theoretic algorithm is devised. Several examples underline that the proposed methodology provides a promising approach for mitigating the curse of dimensionality which would be involved in case the optimal feedback law was obtained by solving the Hamilton Jacobi Bellman equation.
Sensitivity-Free Gradient Descent Algorithms
http://jmlr.org/papers/v24/22-1002.html
http://jmlr.org/papers/volume24/22-1002/22-1002.pdf
2023Ion Matei, Maksym Zhenirovskyy, Johan de Kleer, John Maxwell
We introduce two block coordinate descent algorithms for solving optimization problems with ordinary differential equations (ODEs) as dynamical constraints. In contrast to prior algorithms, ours do not need to implement sensitivity analysis methods to evaluate loss function gradients. They result from the reformulation of the original problem as an equivalent optimization problem with equality constraints. In our first algorithm we avoid explicitly solving the ODE by integrating the ODE solver as a sequence of implicit constraints. In our second algorithm, we add an ODE solver to reset the estimate of the ODE solution, but no sensitivity analysis method is needed. We test the proposed algorithms on the problem of learning the parameters of the Cucker-Smale model. The algorithms are compared with gradient descent algorithms based on ODE solvers endowed with sensitivity analysis capabilities. We show that the proposed algorithms are at least 4x faster when implemented in Pytorch, and at least 16x faster when implemented in Jax. For large versions of the Cucker-Smale model, the Jax implementation is thousands of times faster. Our algorithms generate more accurate results both on training and test data. In addition, we show how the proposed algorithms scale with the number of optimization variables, and how they can be applied to learning black-box models of dynamical systems. Moreover, we demonstrate how our approach can be combined with approaches based on sensitivity analysis enabled ODE solvers to reduce the training time.
A PDE approach for regret bounds under partial monitoring
http://jmlr.org/papers/v24/22-1001.html
http://jmlr.org/papers/volume24/22-1001/22-1001.pdf
2023Erhan Bayraktar, Ibrahim Ekren, Xin Zhang
In this paper, we study a learning problem in which a forecaster only observes partial information. By properly rescaling the problem, we heuristically derive a limiting PDE on Wasserstein space which characterizes the asymptotic behavior of the regret of the forecaster. Using a verification type argument, we show that the problem of obtaining regret bounds and efficient algorithms can be tackled by finding appropriate smooth sub/supersolutions of this parabolic PDE.
A General Learning Framework for Open Ad Hoc Teamwork Using Graph-based Policy Learning
http://jmlr.org/papers/v24/22-099.html
http://jmlr.org/papers/volume24/22-099/22-099.pdf
2023Arrasy Rahman, Ignacio Carlucho, Niklas Höpner, Stefano V. Albrecht
Open ad hoc teamwork is the problem of training a single agent to efficiently collaborate with an unknown group of teammates whose composition may change over time. A variable team composition creates challenges for the agent, such as the requirement to adapt to new team dynamics and dealing with changing state vector sizes. These challenges are aggravated in real-world applications in which the controlled agent only has a partial view of the environment. In this work, we develop a class of solutions for open ad hoc teamwork under full and partial observability. We start by developing a solution for the fully observable case that leverages graph neural network architectures to obtain an optimal policy based on reinforcement learning. We then extend this solution to partially observable scenarios by proposing different methodologies that maintain belief estimates over the latent environment states and team composition. These belief estimates are combined with our solution for the fully observable case to compute an agent's optimal policy under partial observability in open ad hoc teamwork. Empirical results demonstrate that our solution can learn efficient policies in open ad hoc teamwork in fully and partially observable cases. Further analysis demonstrates that our methods' success is a result of effectively learning the effects of teammates' actions while also inferring the inherent state of the environment under partial observability.
Causal Bandits for Linear Structural Equation Models
http://jmlr.org/papers/v24/22-0969.html
http://jmlr.org/papers/volume24/22-0969/22-0969.pdf
2023Burak Varici, Karthikeyan Shanmugam, Prasanna Sattigeri, Ali Tajer
This paper studies the problem of designing an optimal sequence of interventions in a causal graphical model to minimize cumulative regret with respect to the best intervention in hindsight. This is, naturally, posed as a causal bandit problem. The focus is on causal bandits for linear structural equation models (SEMs) and soft interventions. It is assumed that the graph's structure is known and has $N$ nodes. Two linear mechanisms, one soft intervention and one observational, are assumed for each node, giving rise to $2^N$ possible interventions. The majority of the existing causal bandit algorithms assume that at least the interventional distributions of the reward node's parents are fully specified. However, there are $2^N$ such distributions (one corresponding to each intervention), acquiring which becomes prohibitive even in moderate-sized graphs. This paper dispenses with the assumption of knowing these distributions or their marginals. Two algorithms are proposed for the frequentist (UCB-based) and Bayesian (Thompson sampling-based) settings. The key idea of these algorithms is to avoid directly estimating the $2^N$ reward distributions and instead estimate the parameters that fully specify the SEMs (linear in $N$) and use them to compute the rewards. In both algorithms, under boundedness assumptions on noise and the parameter space, the cumulative regrets scale as $\tilde{\cal O} (d^{L+\frac{1}{2}} \sqrt{NT})$, where $d$ is the graph's maximum degree, and $L$ is the length of its longest causal path. Additionally, a minimax lower of $\Omega(d^{\frac{L}{2}-2}\sqrt{T})$ is presented, which suggests that the achievable and lower bounds conform in their scaling behavior with respect to the horizon $T$ and graph parameters $d$ and $L$.
High-Dimensional Inference for Generalized Linear Models with Hidden Confounding
http://jmlr.org/papers/v24/22-0834.html
http://jmlr.org/papers/volume24/22-0834/22-0834.pdf
2023Jing Ouyang, Kean Ming Tan, Gongjun Xu
Statistical inferences for high-dimensional regression models have been extensively studied for their wide applications ranging from genomics, neuroscience, to economics. However, in practice, there are often potential unmeasured confounders associated with both the response and covariates, which can lead to invalidity of standard debiasing methods. This paper focuses on a generalized linear regression framework with hidden confounding and proposes a debiasing approach to address this high-dimensional problem, by adjusting for the effects induced by the unmeasured confounders. We establish consistency and asymptotic normality for the proposed debiased estimator. The finite sample performance of the proposed method is demonstrated through extensive numerical studies and an application to a genetic data set.
Weibull Racing Survival Analysis with Competing Events, Left Truncation, and Time-Varying Covariates
http://jmlr.org/papers/v24/22-0833.html
http://jmlr.org/papers/volume24/22-0833/22-0833.pdf
2023Quan Zhang, Yanxun Xu, Mei-Cheng Wang, Mingyuan Zhou
We propose Bayesian nonparametric Weibull delegate racing (WDR) to fill the gap in interpretable nonlinear survival analysis with competing events, left truncation, and time-varying covariates. We set a two-phase race among a potentially infinite number of sub-events to model nonlinear covariate effects, which does not rely on transformations or complex functions of the covariates. Using gamma processes, the nonlinear capacity of WDR is parsimonious and data-adaptive. In prediction accuracy, WDR dominates cause-specific Cox and Fine-Gray models and is comparable to random survival forests in the presence of time-invariant covariates. More importantly, WDR can cope with different types of censoring, missing outcomes, left truncation, and time-varying covariates, on which other nonlinear models, such as the random survival forests, Gaussian processes, and deep learning approaches, are largely silent. We develop an efficient MCMC algorithm based on Gibbs sampling. We analyze biomedical data, interpret disease progression affected by covariates, and show the potential of WDR in discovering and diagnosing new diseases.
Erratum: Risk Bounds for the Majority Vote: From a PAC-Bayesian Analysis to a Learning Algorithm
http://jmlr.org/papers/v24/22-0747.html
http://jmlr.org/papers/volume24/22-0747/22-0747.pdf
2023Louis-Philippe Vignault, Audrey Durand, Pascal Germain
This work shows that the demonstration of Proposition 15 of Germain et al. (2015) is flawed and the proposition is false in a general setting. This proposition gave an inequality that upper-bounds the variance of the margin of a weighted majority vote classifier. Even though this flaw has little impact on the validity of the other results presented in Germain et al. (2015), correcting it leads to a deeper understanding of the $\mathcal{C}$-bound, which is a key inequality that upper-bounds the risk of a majority vote classifier by the moments of its margin, and to a new result, namely a lower-bound on the $\mathcal{C}$-bound. Notably, Germain et al.'s statement that “the $\mathcal{C}$-bound can be arbitrarily small” is invalid in presence of irreducible error in learning problems with label noise. In this erratum, we pinpoint the mistake present in the demonstration of the said proposition, we give a corrected version of the proposition, and we propose a new theoretical lower bound on the $\mathcal{C}$-bound.
Augmented Transfer Regression Learning with Semi-non-parametric Nuisance Models
http://jmlr.org/papers/v24/22-0700.html
http://jmlr.org/papers/volume24/22-0700/22-0700.pdf
2023Molei Liu, Yi Zhang, Katherine P. Liao, Tianxi Cai
We develop an augmented transfer regression learning (ATReL) approach that introduces an imputation model to augment the importance weighting equation to achieve double robustness for covariate shift correction. More significantly, we propose a novel semi-non-parametric (SNP) construction framework for the two nuisance models. Compared with existing doubly robust approaches relying on fully parametric or fully non-parametric (machine learning) nuisance models, our proposal is more flexible and balanced to address model misspecification and the curse of dimensionality, achieving a better trade-off in terms of model complexity. The SNP construction presents a new technical challenge in controlling the first-order bias caused by the nuisance estimators. To overcome this, we propose a two-step calibrated estimating approach to construct the nuisance models that ensures the effective reduction of potential bias. Under this SNP framework, our ATReL estimator is root-n-consistent when (i) at least one nuisance model is correctly specified and (ii) the nonparametric components are rate-doubly robust. Simulation studies demonstrate that our method is more robust and efficient than existing methods under various configurations. We also examine the utility of our method through a real transfer learning example of the phenotyping algorithm for rheumatoid arthritis across different time windows. Finally, we propose ways to enhance the intrinsic efficiency of our estimator and to incorporate modern machine-learning methods in the proposed SNP framework.
From Understanding Genetic Drift to a Smart-Restart Mechanism for Estimation-of-Distribution Algorithms
http://jmlr.org/papers/v24/22-0628.html
http://jmlr.org/papers/volume24/22-0628/22-0628.pdf
2023Weijie Zheng, Benjamin Doerr
Estimation-of-distribution algorithms (EDAs) are optimization algorithms that learn a distribution from which good solutions can be sampled easily. A key parameter of most EDAs is the sample size (population size). Too small values lead to the undesired effect of genetic drift, while larger values slow down the process. Building on a quantitative analysis of how the population size leads to genetic drift, we design a smart-restart mechanism for EDAs. By stopping runs when the risk for genetic drift is high, it automatically runs the EDA in good parameter regimes. Via a mathematical runtime analysis, we prove a general performance guarantee for this smart-restart scheme. For many situations where the optimal parameter values are known, this shows that the restart scheme automatically finds these optimal values, leading to the asymptotically optimal performance. We also conduct an extensive experimental analysis. On four classic benchmarks, the smart-restart scheme leads to a performance close to the one obtainable with optimal parameter values. We also conduct experiments with PBIL (cross-entropy algorithm) on the max-cut problem and the bipartition problem. Again, the smart-restart mechanism finds much better values for the population size than those suggested in the literature, leading to a much better performance.
A Unified Analysis of Multi-task Functional Linear Regression Models with Manifold Constraint and Composite Quadratic Penalty
http://jmlr.org/papers/v24/22-0587.html
http://jmlr.org/papers/volume24/22-0587/22-0587.pdf
2023Shiyuan He, Hanxuan Ye, Kejun He
This work studies the multi-task functional linear regression models where both the covariates and the unknown regression coefficients (called slope functions) are curves. For slope function estimation, we employ penalized splines to balance bias, variance, and computational complexity. The power of multi-task learning is brought in by imposing additional structures over the slope functions. We propose a general model with double regularization over the spline coefficient matrix: i) a matrix manifold constraint, and ii) a composite penalty as a summation of quadratic terms. Many multi-task learning approaches can be treated as special cases of this proposed model, such as a reduced-rank model and a graph Laplacian regularized model. We show the composite penalty induces a specific norm, which helps quantify the manifold curvature and determine the corresponding proper subset in the manifold tangent space. The complexity of tangent space subset is then bridged to the complexity of geodesic neighbor via generic chaining. A unified upper bound of the convergence rate is obtained and specifically applied to the reduced-rank model and the graph Laplacian regularized model. The phase transition behaviors for the estimators are examined as we vary the configurations of model parameters.
Deletion and Insertion Tests in Regression Models
http://jmlr.org/papers/v24/22-0560.html
http://jmlr.org/papers/volume24/22-0560/22-0560.pdf
2023Naofumi Hama, Masayoshi Mase, Art B. Owen
A basic task in explainable AI (XAI) is to identify the most important features behind a prediction made by a black box function f. The insertion and deletion tests of Petsiuk et al. (2018) can be used to judge the quality of algorithms that rank pixels from most to least important for a classification. Motivated by regression problems we establish a formula for their area under the curve (AUC) criteria in terms of certain main effects and interactions in an anchored decomposition of f. We find an expression for the expected value of the AUC under a random ordering of inputs to f and propose an alternative area above a straight line for the regression setting. We use this criterion to compare feature importances computed by integrated gradients (IG) to those computed by Kernel SHAP (KS) as well as LIME, DeepLIFT, vanilla gradient and input×gradient methods. KS has the best overall performance in two datasets we consider but it is very expensive to compute. We find that IG is nearly as good as KS while being much faster. Our comparison problems include some binary inputs that pose a challenge to IG because it must use values between the possible variable levels and so we consider ways to handle binary variables in IG. We show that sorting variables by their Shapley value does not necessarily give the optimal ordering for an insertion-deletion test. It will however do that for monotone functions of additive models, such as logistic regression.
Deep Neural Networks with Dependent Weights: Gaussian Process Mixture Limit, Heavy Tails, Sparsity and Compressibility
http://jmlr.org/papers/v24/22-0537.html
http://jmlr.org/papers/volume24/22-0537/22-0537.pdf
2023Hoil Lee, Fadhel Ayed, Paul Jung, Juho Lee, Hongseok Yang, Francois Caron
This article studies the infinite-width limit of deep feedforward neural networks whose weights are dependent, and modelled via a mixture of Gaussian distributions. Each hidden node of the network is assigned a nonnegative random variable that controls the variance of the outgoing weights of that node. We make minimal assumptions on these per-node random variables: they are iid and their sum, in each layer, converges to some finite random variable in the infinite-width limit. Under this model, we show that each layer of the infinite-width neural network can be characterised by two simple quantities: a non-negative scalar parameter and a L\'evy measure on the positive reals. If the scalar parameters are strictly positive and the L\'evy measures are trivial at all hidden layers, then one recovers the classical Gaussian process (GP) limit, obtained with iid Gaussian weights. More interestingly, if the L\'evy measure of at least one layer is non-trivial, we obtain a mixture of Gaussian processes (MoGP) in the large-width limit. The behaviour of the neural network in this regime is very different from the GP regime. One obtains correlated outputs, with non-Gaussian distributions, possibly with heavy tails. Additionally, we show that, in this regime, the weights are compressible, and some nodes have asymptotically non-negligible contributions, therefore representing important hidden features. Many sparsity-promoting neural network models can be recast as special cases of our approach, and we discuss their infinite-width limits; we also present an asymptotic analysis of the pruning error. We illustrate some of the benefits of the MoGP regime over the GP regime in terms of representation learning and compressibility on simulated, MNIST and Fashion MNIST datasets.
A New Look at Dynamic Regret for Non-Stationary Stochastic Bandits
http://jmlr.org/papers/v24/22-0387.html
http://jmlr.org/papers/volume24/22-0387/22-0387.pdf
2023Yasin Abbasi-Yadkori, András György, Nevena Lazić
We study the non-stationary stochastic multi-armed bandit problem, where the reward statistics of each arm may change several times during the course of learning. The performance of a learning algorithm is evaluated in terms of its dynamic regret, which is defined as the difference between the expected cumulative reward of an agent choosing the optimal arm in every time step and the cumulative reward of the learning algorithm. One way to measure the hardness of such environments is to consider how many times the identity of the optimal arm can change. We propose a method that achieves, in $K$-armed bandit problems, a near-optimal $\widetilde O(\sqrt{K N(S+1)})$ dynamic regret, where $N$ is the time horizon of the problem and $S$ is the number of times the identity of the optimal arm changes, without prior knowledge of $S$. Previous works for this problem obtain regret bounds that scale with the number of changes (or the amount of change) in the reward functions, which can be much larger, or assume prior knowledge of $S$ to achieve similar bounds.
Universal Approximation Property of Invertible Neural Networks
http://jmlr.org/papers/v24/22-0384.html
http://jmlr.org/papers/volume24/22-0384/22-0384.pdf
2023Isao Ishikawa, Takeshi Teshima, Koichi Tojo, Kenta Oono, Masahiro Ikeda, Masashi Sugiyama
Invertible neural networks (INNs) are neural network architectures with invertibility by design. Thanks to their invertibility and the tractability of their Jacobians, INNs have various machine learning applications such as probabilistic modeling, generative modeling, and representation learning. However, their attractive properties often come at the cost of restricting the layer design, which poses a question on their representation power: can we use these models to approximate sufficiently diverse functions? To answer this question, we have developed a general theoretical framework to investigate the representation power of INNs, building on a structure theorem of differential geometry. The framework simplifies the approximation problem of diffeomorphisms, which enables us to show the universal approximation properties of INNs. We apply the framework to two representative classes of INNs, namely Coupling-Flow-based INNs (CF-INNs) and Neural Ordinary Differential Equations (NODEs), and elucidate their high representation power despite the restrictions on their architectures.
Low Tree-Rank Bayesian Vector Autoregression Models
http://jmlr.org/papers/v24/22-0360.html
http://jmlr.org/papers/volume24/22-0360/22-0360.pdf
2023Leo L Duan, Zeyu Yuwen, George Michailidis, Zhengwu Zhang
Vector autoregression has been widely used for modeling and analysis of multivariate time series data. In high-dimensional settings, model parameter regularization schemes inducing sparsity yield interpretable models and achieved good forecasting performance. However, in many data applications, such as those in neuroscience, the Granger causality graph estimates from existing vector autoregression methods tend to be quite dense and difficult to interpret, unless one compromises on the goodness-of-fit. To address this issue, this paper proposes to incorporate a commonly used structural assumption --- that the ground-truth graph should be largely connected, in the sense that it should only contain at most a few components. We take a Bayesian approach and develop a novel tree-rank prior distribution for the regression coefficients. Specifically, this prior distribution forces the non-zero coefficients to appear only on the union of a few spanning trees. Since each spanning tree connects $p$ nodes with only $(p-1)$ edges, it effectively achieves both high connectivity and high sparsity. We develop a computationally efficient Gibbs sampler that is scalable to large sample size and high dimension. In analyzing test-retest functional magnetic resonance imaging data, our model produces a much more interpretable graph estimate, compared to popular existing approaches. In addition, we show appealing properties of this new method, such as efficient computation, mild stability conditions and posterior consistency.
Generic Unsupervised Optimization for a Latent Variable Model With Exponential Family Observables
http://jmlr.org/papers/v24/22-0359.html
http://jmlr.org/papers/volume24/22-0359/22-0359.pdf
2023Hamid Mousavi, Jakob Drefs, Florian Hirschberger, Jörg Lücke
Latent variable models (LVMs) represent observed variables by parameterized functions of latent variables. Prominent examples of LVMs for unsupervised learning are probabilistic PCA or probabilistic sparse coding which both assume a weighted linear summation of the latents to determine the mean of a Gaussian distribution for the observables. In many cases, however, observables do not follow a Gaussian distribution. For unsupervised learning, LVMs which assume specific non-Gaussian observables (e.g., Bernoulli or Poisson) have therefore been considered. Already for specific choices of distributions, parameter optimization is challenging and only a few previous contributions considered LVMs with more generally defined observable distributions. In this contribution, we do consider LVMs that are defined for a range of different distributions, i.e., observables can follow any (regular) distribution of the exponential family. Furthermore, the novel class of LVMs presented here is defined for binary latents, and it uses maximization in place of summation to link the latents to observables. In order to derive an optimization procedure, we follow an expectation maximization approach for maximum likelihood parameter estimation. We then show, as our main result, that a set of very concise parameter update equations can be derived which feature the same functional form for all exponential family distributions. The derived generic optimization can consequently be applied (without further derivations) to different types of metric data (Gaussian and non-Gaussian) as well as to different types of discrete data. Moreover, the derived optimization equations can be combined with a recently suggested variational acceleration which is likewise generically applicable to the LVMs considered here. Thus, the combination maintains generic and direct applicability of the derived optimization procedure, but, crucially, enables efficient scalability. We numerically verify our analytical results using different observable distributions, and, furthermore, discuss some potential applications such as learning of variance structure, noise type estimation and denoising.
A Complete Characterization of Linear Estimators for Offline Policy Evaluation
http://jmlr.org/papers/v24/22-0341.html
http://jmlr.org/papers/volume24/22-0341/22-0341.pdf
2023Juan C. Perdomo, Akshay Krishnamurthy, Peter Bartlett, Sham Kakade
Offline policy evaluation is a fundamental statistical problem in reinforcement learning that involves estimating the value function of some decision-making policy given data collected by a potentially different policy. In order to tackle problems with complex, high-dimensional observations, there has been significant interest from theoreticians and practitioners alike in understanding the possibility of function approximation in reinforcement learning. Despite significant study, a sharp characterization of when we might expect offline policy evaluation to be tractable, even in the simplest setting of linear function approximation, has so far remained elusive, with a surprising number of strong negative results recently appearing in the literature. In this work, we identify simple control-theoretic and linear-algebraic conditions that are necessary and sufficient for classical methods, in particular Fitted Q-iteration (FQI) and least squares temporal difference learning (LSTD), to succeed at offline policy evaluation. Using this characterization, we establish a precise hierarchy of regimes under which these estimators succeed. We prove that LSTD works under strictly weaker conditions than FQI. Furthermore, we establish that if a problem is not solvable via LSTD, then it cannot be solved by a broad class of linear estimators, even in the limit of infinite data. Taken together, our results provide a complete picture of the behavior of linear estimators for offline policy evaluation, unify previously disparate analyses of canonical algorithms, and provide significantly sharper notions of the underlying statistical complexity of offline policy evaluation.
Near-Optimal Weighted Matrix Completion
http://jmlr.org/papers/v24/22-0331.html
http://jmlr.org/papers/volume24/22-0331/22-0331.pdf
2023Oscar López
Recent work in the matrix completion literature has shown that prior knowledge of a matrix's row and column spaces can be successfully incorporated into reconstruction programs to substantially benefit matrix recovery. This paper proposes a novel methodology that exploits more general forms of known matrix structure in terms of subspaces. The work derives reconstruction error bounds that are informative in practice, providing insight to previous approaches in the literature while introducing novel programs with reduced sample complexities. The main result shows that a family of weighted nuclear norm minimization programs incorporating a $M_1 r$-dimensional subspace of $n\times n$ matrices (where $M_1\geq 1$ conveys structural properties of the subspace) allow accurate approximation of a rank $r$ matrix aligned with the subspace from a near-optimal number of observed entries (within a logarithmic factor of $M_1 r)$. The result is robust, where the error is proportional to measurement noise, applies to full rank matrices, and reflects degraded output when erroneous prior information is imposed. Numerical experiments are presented that validate the theoretical behavior derived for several example weighted programs.
Community models for networks observed through edge nominations
http://jmlr.org/papers/v24/22-0313.html
http://jmlr.org/papers/volume24/22-0313/22-0313.pdf
2023Tianxi Li, Elizaveta Levina, Ji Zhu
Communities are a common and widely studied structure in networks, typically assuming that the network is fully and correctly observed. In practice, network data are often collected by querying nodes about their connections. In some settings, all edges of a sampled node will be recorded, and in others, a node may be asked to name its connections. These sampling mechanisms introduce noise and bias, which can obscure the community structure and invalidate assumptions underlying standard community detection methods. We propose a general model for a class of network sampling mechanisms based on recording edges via querying nodes, designed to improve community detection for network data collected in this fashion. We model edge sampling probabilities as a function of both individual preferences and community parameters, and show community detection can be performed by spectral clustering under this general class of models. We also propose, as a special case of the general framework, a parametric model for directed networks we call the nomination stochastic block model, which allows for meaningful parameter interpretations and can be fitted by the method of moments. In this case, spectral clustering and the method of moments are computationally efficient and come with theoretical guarantees of consistency. We evaluate the proposed model in simulation studies on unweighted and weighted networks and under misspecified models. The method is applied to a faculty hiring dataset, discovering a meaningful hierarchy of communities among US business schools.
The Bayesian Learning Rule
http://jmlr.org/papers/v24/22-0291.html
http://jmlr.org/papers/volume24/22-0291/22-0291.pdf
2023Mohammad Emtiyaz Khan, Håvard Rue
We show that many machine-learning algorithms are specific instances of a single algorithm called the Bayesian learning rule. The rule, derived from Bayesian principles, yields a wide-range of algorithms from fields such as optimization, deep learning, and graphical models. This includes classical algorithms such as ridge regression, Newton's method, and Kalman filter, as well as modern deep-learning algorithms such as stochastic-gradient descent, RMSprop, and Dropout. The key idea in deriving such algorithms is to approximate the posterior using candidate distributions estimated by using natural gradients. Different candidate distributions result in different algorithms and further approximations to natural gradients give rise to variants of those algorithms. Our work not only unifies, generalizes, and improves existing algorithms, but also helps us design new ones.
Removing Data Heterogeneity Influence Enhances Network Topology Dependence of Decentralized SGD
http://jmlr.org/papers/v24/22-0283.html
http://jmlr.org/papers/volume24/22-0283/22-0283.pdf
2023Kun Yuan, Sulaiman A. Alghunaim, Xinmeng Huang
We consider decentralized stochastic optimization problems, where a network of $n$ nodes cooperates to find a minimizer of the globally-averaged cost. A widely studied decentralized algorithm for this problem is the decentralized SGD (D-SGD), in which each node averages only with its neighbors. D-SGD is efficient in single-iteration communication, but it is very sensitive to the network topology. For smooth objective functions, the transient stage (which measures the number of iterations the algorithm has to experience before achieving the linear speedup stage) of D-SGD is on the order of ${O}(n/(1-\beta)^2)$ and $O(n^3/(1-\beta)^4)$ for strongly and generally convex cost functions, respectively, where $1-\beta \in (0,1)$ is a topology-dependent quantity that approaches $0$ for a large and sparse network. Hence, D-SGD suffers from slow convergence for large and sparse networks. In this work, we revisit the convergence property of the D$^2$/Exact-Diffusion algorithm. By eliminating the influence of data heterogeneity between nodes, D$^2$/Exact-diffusion is shown to have an enhanced transient stage that is on the order of $\tilde{O}(n/(1-\beta))$ and $O(n^3/(1-\beta)^2)$ for strongly and generally convex cost functions (where $\tilde{O}(\cdot)$ hides all logarithm factors), respectively. Moreover, when D$^2$/Exact-Diffusion is implemented with both gradient accumulation and multi-round gossip communications, its transient stage can be further improved to $\tilde{O}(1/(1-\beta)^{\frac{1}{2}})$ and $\tilde{O}(n/(1-\beta))$ for strongly and generally convex cost functions, respectively. To our knowledge, these established results for D$^2$/Exact-Diffusion have the best, i.e., weakest) dependence on network topology compared to existing decentralized algorithms. Numerical simulations are conducted to validate our theories.
Sparse Markov Models for High-dimensional Inference
http://jmlr.org/papers/v24/22-0266.html
http://jmlr.org/papers/volume24/22-0266/22-0266.pdf
2023Guilherme Ost, Daniel Y. Takahashi
Finite-order Markov models are well-studied models for dependent finite alphabet data. Despite their generality, application in empirical work is rare when the order $d$ is large relative to the sample size $n$ (e.g., $d = \mathcal{O}(n)$). Practitioners rarely use higher-order Markov models because (1) the number of parameters grows exponentially with the order, (2) the sample size $n$ required to estimate each parameter grows exponentially with the order, and (3) the interpretation is often difficult. Here, we consider a subclass of Markov models called Mixture of Transition Distribution (MTD) models, proving that when the set of relevant lags is sparse (i.e., $\mathcal{O}(\log(n))$), we can consistently and efficiently recover the lags and estimate the transition probabilities of high-dimensional ($d = \mathcal{O}(n)$) MTD models. Moreover, the estimated model allows straightforward interpretation. The key innovation is a recursive procedure for a priori selection of the relevant lags of the model. We prove a new structural result for the MTD and an improved martingale concentration inequality to prove our results. Using simulations, we show that our method performs well compared to other relevant methods. We also illustrate the usefulness of our method on weather data where the proposed method correctly recovers the long-range dependence.
Distinguishing Cause and Effect in Bivariate Structural Causal Models: A Systematic Investigation
http://jmlr.org/papers/v24/22-0151.html
http://jmlr.org/papers/volume24/22-0151/22-0151.pdf
2023Christoph Käding,, Jakob Runge,
Distinguishing cause and effect from purely observational data is a fundamental problem in science. Even the atomic bivariate case, seemingly the simplest, is challenging and re- quires further assumptions to be identifiable at all. In recent years a variety of approaches to address this problem has been developed, each with its own assumptions, strengths, and weaknesses. In machine learning common benchmarks with real and synthetic data have been a main driver of innovation. Synthetic benchmarks can explicitly model data characteristics such as the underlying functional relations and distributions to assess how methods deal with these. However, a systematic assessment of the state-of-the-art of meth- ods is currently missing. We provide a detailed and systematic comparison of a range of methods on a novel collection of datasets that systematically models individual data challenges. Further, we evaluate more recent methods missing in previous benchmarks. The novel suite of datasets will be contributed to the causeme.net benchmark platform to provide a continuously updated and searchable causal discovery method intercomparison database. Our aim is to assist users in finding the most suitable methods for their problem setting and for method developers to improve current and develop new methods.
Elastic Gradient Descent, an Iterative Optimization Method Approximating the Solution Paths of the Elastic Net
http://jmlr.org/papers/v24/22-0119.html
http://jmlr.org/papers/volume24/22-0119/22-0119.pdf
2023Oskar Allerbo, Johan Jonasson, Rebecka Jörnsten
The elastic net combines lasso and ridge regression to fuse the sparsity property of lasso with the grouping property of ridge regression. The connections between ridge regression and gradient descent and between lasso and forward stagewise regression have previously been shown. Similar to how the elastic net generalizes lasso and ridge regression, we introduce elastic gradient descent, a generalization of gradient descent and forward stagewise regression. We theoretically analyze elastic gradient descent and compare it to the elastic net and forward stagewise regression. Parts of the analysis are based on elastic gradient flow, a piecewise analytical construction, obtained for elastic gradient descent with infinitesimal step size. We also compare elastic gradient descent to the elastic net on real and simulated data and show that it provides similar solution paths, but is several orders of magnitude faster. Compared to forward stagewise regression, elastic gradient descent selects a model that, although still sparse, provides considerably lower prediction and estimation errors.
On Biased Compression for Distributed Learning
http://jmlr.org/papers/v24/21-1548.html
http://jmlr.org/papers/volume24/21-1548/21-1548.pdf
2023Aleksandr Beznosikov, Samuel Horváth, Peter Richtárik, Mher Safaryan
In the last few years, various communication compression techniques have emerged as an indispensable tool helping to alleviate the communication bottleneck in distributed learning. However, despite the fact biased compressors often show superior performance in practice when compared to the much more studied and understood unbiased compressors, very little is known about them. In this work we study three classes of biased compression operators, two of which are new, and their performance when applied to (stochastic) gradient descent and distributed (stochastic) gradient descent. We show for the first time that biased compressors can lead to linear convergence rates both in the single node and distributed settings. We prove that distributed compressed SGD method, employed with error feedback mechanism, enjoys the ergodic rate $O\left( \delta L \exp[-\frac{\mu K}{\delta L}] + \frac{(C + \delta D)}{K\mu}\right)$, where $\delta\ge1$ is a compression parameter which grows when more compression is applied, $L$ and $\mu$ are the smoothness and strong convexity constants, $C$ captures stochastic gradient noise ($C=0$ if full gradients are computed on each node) and $D$ captures the variance of the gradients at the optimum ($D=0$ for over-parameterized models). Further, via a theoretical study of several synthetic and empirical distributions of communicated gradients, we shed light on why and by how much biased compressors outperform their unbiased variants. Finally, we propose several new biased compressors with promising theoretical guarantees and practical performance.
Adaptive Clustering Using Kernel Density Estimators
http://jmlr.org/papers/v24/21-1532.html
http://jmlr.org/papers/volume24/21-1532/21-1532.pdf
2023Ingo Steinwart, Bharath K. Sriperumbudur, Philipp Thomann
We derive and analyze a generic, recursive algorithm for estimating all splits in a finite cluster tree as well as the corresponding clusters. We further investigate statistical properties of this generic clustering algorithm when it receives level set estimates from a kernel density estimator. In particular, we derive finite sample guarantees, consistency, rates of convergence, and an adaptive data-driven strategy for choosing the kernel bandwidth. For these results we do not need continuity assumptions on the density such as Hölder continuity, but only require intuitive geometric assumptions of non-parametric nature. In addition, we compare our results to other guarantees found in the literature and also present some experiments comparing our algorithm to $k$-means and hierarchical clustering.
A Continuous-time Stochastic Gradient Descent Method for Continuous Data
http://jmlr.org/papers/v24/21-1463.html
http://jmlr.org/papers/volume24/21-1463/21-1463.pdf
2023Kexin Jin, Jonas Latz, Chenguang Liu, Carola-Bibiane Schönlieb
Optimization problems with continuous data appear in, e.g., robust machine learning, functional data analysis, and variational inference. Here, the target function is given as an integral over a family of (continuously) indexed target functions---integrated with respect to a probability measure. Such problems can often be solved by stochastic optimization methods: performing optimization steps with respect to the indexed target function with randomly switched indices. In this work, we study a continuous-time variant of the stochastic gradient descent algorithm for optimization problems with continuous data. This so-called stochastic gradient process consists in a gradient flow minimizing an indexed target function that is coupled with a continuous-time index process determining the index. Index processes are, e.g., reflected diffusions, pure jump processes, or other Lévy processes on compact spaces. Thus, we study multiple sampling patterns for the continuous data space and allow for data simulated or streamed at runtime of the algorithm. We analyze the approximation properties of the stochastic gradient process and study its longtime behavior and ergodicity under constant and decreasing learning rates. We end with illustrating the applicability of the stochastic gradient process in a polynomial regression problem with noisy functional data, as well as in a physics-informed neural network.
Online Non-stochastic Control with Partial Feedback
http://jmlr.org/papers/v24/21-1444.html
http://jmlr.org/papers/volume24/21-1444/21-1444.pdf
2023Yu-Hu Yan, Peng Zhao, Zhi-Hua Zhou
Online control with non-stochastic disturbances and adversarially chosen convex cost functions, referred to as online non-stochastic control, has recently attracted increasing attention. We study online non-stochastic control with partial feedback, where learners can only access partially observed states and partially informed (bandit) costs. The problem setting arises naturally in real-world decision-making applications and strictly generalizes exceptional cases studied disparately by previous works. We propose the first online algorithm for this problem, with an $\tilde{O}(T^{3/4})$ regret competing with the best policy in hindsight, where $T$ denotes the time horizon and the $\tilde{O}(\cdot)$-notation omits the poly-logarithmic factors in $T$. To further enhance the algorithms' robustness to changing environments, we then design a novel method with a two-layer structure to optimize the dynamic regret, a more challenging measure that competes with time-varying policies. Our method is based on the online ensemble framework by treating the controller above as the base learner. On top of that, we design two different meta-combiners to simultaneously handle the unknown variation of environments and the memory issue arising from the online control. We prove that the two resulting algorithms enjoy $\tilde{O}(T^{3/4}(1+P_T)^{1/2})$ and $\tilde{O}(T^{3/4}(1+P_T)^{1/4}+T^{5/6})$ dynamic regret respectively, where $P_T$ measures the environmental non-stationarity. Our results are further extended to unknown transition matrices. Finally, empirical studies in both synthetic linear and simulated nonlinear tasks validate our method's effectiveness, thus supporting the theoretical findings.
Distributed Sparse Regression via Penalization
http://jmlr.org/papers/v24/21-1333.html
http://jmlr.org/papers/volume24/21-1333/21-1333.pdf
2023Yao Ji, Gesualdo Scutari, Ying Sun, Harsha Honnappa
We study sparse linear regression over a network of agents, modeled as an undirected graph (with no centralized node). The estimation problem is formulated as the minimization of the sum of the local LASSO loss functions plus a quadratic penalty of the consensus constraint—the latter being instrumental to obtain distributed solution methods. While penalty-based consensus methods have been extensively studied in the optimization literature, their statistical and computational guarantees in the high dimensional setting remain unclear. This work provides an answer to this open problem. Our contribution is two-fold. First, we establish statistical consistency of the estimator: under a suitable choice of the penalty parameter, the optimal solution of the penalized problem achieves near optimal minimax rate $O(s \log d/N)$ in $\ell_2$-loss, where $s$ is the sparsity value, $d$ is the ambient dimension, and $N$ is the total sample size in the network—this matches centralized sample rates. Second, we show that the proximal-gradient algorithm applied to the penalized problem, which naturally leads to distributed implementations, converges linearly up to a tolerance of the order of the centralized statistical error---the rate scales as $O(d)$, revealing an unavoidable speed-accuracy dilemma. Numerical results demonstrate the tightness of the derived sample rate and convergence rate scalings.
Causal Discovery with Unobserved Confounding and Non-Gaussian Data
http://jmlr.org/papers/v24/21-1329.html
http://jmlr.org/papers/volume24/21-1329/21-1329.pdf
2023Y. Samuel Wang, Mathias Drton
We consider recovering causal structure from multivariate observational data. We assume the data arise from a linear structural equation model (SEM) in which the idiosyncratic errors are allowed to be dependent in order to capture possible latent confounding. Each SEM can be represented by a graph where vertices represent observed variables, directed edges represent direct causal effects, and bidirected edges represent dependence among error terms. Specifically, we assume that the true model corresponds to a bow-free acyclic path diagram; i.e., a graph that has at most one edge between any pair of nodes and is acyclic in the directed part. We show that when the errors are non-Gaussian, the exact causal structure encoded by such a graph, and not merely an equivalence class, can be recovered from observational data. The method we propose for this purpose uses estimates of suitable moments, but, in contrast to previous results, does not require specifying the number of latent variables a priori. We also characterize the output of our procedure when the assumptions are violated and the true graph is acyclic, but not bow-free. We illustrate the effectiveness of our procedure in simulations and an application to an ecology data set.
Sharper Analysis for Minibatch Stochastic Proximal Point Methods: Stability, Smoothness, and Deviation
http://jmlr.org/papers/v24/21-1219.html
http://jmlr.org/papers/volume24/21-1219/21-1219.pdf
2023Xiao-Tong Yuan, Ping Li
The stochastic proximal point (SPP) methods have gained recent attention for stochastic optimization, with strong convergence guarantees and superior robustness to the classic stochastic gradient descent (SGD) methods showcased at little to no cost of computational overhead added. In this article, we study a minibatch variant of SPP, namely M-SPP, for solving convex composite risk minimization problems. The core contribution is a set of novel excess risk bounds of M-SPP derived through the lens of algorithmic stability theory. Particularly under smoothness and quadratic growth conditions, we show that M-SPP with minibatch-size $n$ and iteration count $T$ enjoys an in-expectation fast rate of convergence consisting of an $\mathcal{O}\left(\frac{1}{T^2}\right)$ bias decaying term and an $\mathcal{O}\left(\frac{1}{nT}\right)$ variance decaying term.
In the small-$n$-large-$T$ setting, this result substantially improves the best known results of SPP-type approaches by revealing the impact of noise level of model on convergence rate. In the complementary small-$T$-large-$n$ regime, we propose a two-phase extension of M-SPP to achieve comparable convergence rates. Additionally, we establish a deviation bound on the parameter estimation error of a sampling-without-replacement variant of M-SPP, which holds with high probability over the randomness of data while in expectation over the randomness of algorithm. Numerical evidences are provided to support our theoretical predictions when substantialized to Lasso and logistic regression models.
Dynamic Ranking with the BTL Model: A Nearest Neighbor based Rank Centrality Method
http://jmlr.org/papers/v24/21-1179.html
http://jmlr.org/papers/volume24/21-1179/21-1179.pdf
2023Eglantine Karlé, Hemant Tyagi
Many applications such as recommendation systems or sports tournaments involve pairwise comparisons within a collection of $n$ items, the goal being to aggregate the binary outcomes of the comparisons in order to recover the latent strength and/or global ranking of the items. In recent years, this problem has received significant interest from a theoretical perspective with a number of methods being proposed, along with associated statistical guarantees under the assumption of a suitable generative model. While these results typically collect the pairwise comparisons as one comparison graph $G$, however in many applications -- such as the outcomes of soccer matches during a tournament -- the nature of pairwise outcomes can evolve with time. Theoretical results for such a dynamic setting are relatively limited compared to the aforementioned static setting. We study in this paper an extension of the classic BTL (Bradley-Terry-Luce) model for the static setting to our dynamic setup under the assumption that the probabilities of the pairwise outcomes evolve smoothly over the time domain $[0,1]$. Given a sequence of comparison graphs $(G_{t'})_{t' \in \mathcal{T}}$ on a regular grid $\mathcal{T} \subset [0,1]$, we aim at recovering the latent strengths of the items $w_t^* \in \mathbb{R}^n$ at any time $t \in [0,1]$. To this end, we adapt the Rank Centrality method -- a popular spectral approach for ranking in the static case -- by locally averaging the available data on a suitable neighborhood of $t$. When $(G_{t'})_{t' \in \mathcal{T}}$ is a sequence of Erdös-Renyi graphs, we provide non-asymptotic $\ell_2$ and $\ell_{\infty}$ error bounds for estimating $w_t^*$ which in particular establishes the consistency of this method in terms of $n$, and the grid size $|\mathcal{T}|$. We also complement our theoretical analysis with experiments on real and synthetic data.
Revisiting minimum description length complexity in overparameterized models
http://jmlr.org/papers/v24/21-1133.html
http://jmlr.org/papers/volume24/21-1133/21-1133.pdf
2023Raaz Dwivedi, Chandan Singh, Bin Yu, Martin Wainwright
Complexity is a fundamental concept underlying statistical learning theory that aims to inform generalization performance. Parameter count, while successful in low-dimensional settings, is not well-justified for overparameterized settings when the number of parameters is more than the number of training samples. We revisit complexity measures based on Rissanen's principle of minimum description length (MDL) and define a novel MDL-based complexity (MDL-COMP) that remains valid for overparameterized models. MDL-COMP is defined via an optimality criterion over the encodings induced by a good Ridge estimator class. We provide an extensive theoretical characterization of MDL-COMP for linear models and kernel methods and show that it is not just a function of parameter count, but rather a function of the singular values of the design or the kernel matrix and the signal-to-noise ratio. For a linear model with $n$ observations, $d$ parameters, and i.i.d. Gaussian predictors, MDL-COMP scales linearly with $d$ when $d<n$, but the scaling is exponentially smaller---$\log d$ for $d>n$. For kernel methods, we show that MDL-COMP informs minimax in-sample error, and can decrease as the dimensionality of the input increases. We also prove that MDL-COMP upper bounds the in-sample mean squared error (MSE). Via an array of simulations and real-data experiments, we show that a data-driven Prac-MDL-COMP informs hyper-parameter tuning for optimizing test MSE with ridge regression in limited data settings, sometimes improving upon cross-validation and (always) saving computational costs. Finally, our findings also suggest that the recently observed double decent phenomenons in overparameterized models might be a consequence of the choice of non-ideal estimators.
Sparse Plus Low Rank Matrix Decomposition: A Discrete Optimization Approach
http://jmlr.org/papers/v24/21-1130.html
http://jmlr.org/papers/volume24/21-1130/21-1130.pdf
2023Dimitris Bertsimas, Ryan Cory-Wright, Nicholas A. G. Johnson
We study the Sparse Plus Low-Rank decomposition problem (SLR), which is the problem of decomposing a corrupted data matrix into a sparse matrix of perturbations plus a low-rank matrix containing the ground truth. SLR is a fundamental problem in Operations Research and Machine Learning which arises in various applications, including data compression, latent semantic indexing, collaborative filtering, and medical imaging. We introduce a novel formulation for SLR that directly models its underlying discreteness. For this formulation, we develop an alternating minimization heuristic that computes high-quality solutions and a novel semidefinite relaxation that provides meaningful bounds for the solutions returned by our heuristic. We also develop a custom branch-and-bound algorithm that leverages our heuristic and convex relaxations to solve small instances of SLR to certifiable (near) optimality. Given an input n-by-n matrix, our heuristic scales to solve instances where n = 10000 in minutes, our relaxation scales to instances where n = 200 in hours, and our branch-and-bound algorithm scales to instances where n = 25 in minutes. Our numerical results demonstrate that our approach outperforms existing state-of-the-art approaches in terms of rank, sparsity, and mean-square error while maintaining a comparable runtime.
On the Estimation of Derivatives Using Plug-in Kernel Ridge Regression Estimators
http://jmlr.org/papers/v24/21-1110.html
http://jmlr.org/papers/volume24/21-1110/21-1110.pdf
2023Zejian Liu, Meng Li
We study the problem of estimating the derivatives of a regression function, which has a wide range of applications as a key nonparametric functional of unknown functions. Standard analysis may be tailored to specific derivative orders, and parameter tuning remains a daunting challenge particularly for high-order derivatives. In this article, we propose a simple plug-in kernel ridge regression (KRR) estimator in nonparametric regression with random design that is broadly applicable for multi-dimensional support and arbitrary mixed-partial derivatives. We provide a non-asymptotic analysis to study the behavior of the proposed estimator in a unified manner that encompasses the regression function and its derivatives, leading to two error bounds for a general class of kernels under the strong $L_\infty$ norm. In a concrete example specialized to kernels with polynomially decaying eigenvalues, the proposed estimator recovers the minimax optimal rate up to a logarithmic factor for estimating derivatives of functions in Hölder and Sobolev classes. Interestingly, the proposed estimator achieves the optimal rate of convergence with the same choice of tuning parameter for any order of derivatives. Hence, the proposed estimator enjoys a plug-in property for derivatives in that it automatically adapts to the order of derivatives to be estimated, enabling easy tuning in practice. Our simulation studies show favorable finite sample performance of the proposed method relative to several existing methods and corroborate the theoretical findings on its minimax optimality.
Surrogate Assisted Semi-supervised Inference for High Dimensional Risk Prediction
http://jmlr.org/papers/v24/21-1075.html
http://jmlr.org/papers/volume24/21-1075/21-1075.pdf
2023Jue Hou, Zijian Guo, Tianxi Cai
Risk modeling with electronic health records (EHR) data is challenging due to no direct observations of the disease outcome and the high-dimensional predictors. In this paper, we develop a surrogate assisted semi-supervised learning approach, leveraging small labeled data with annotated outcomes and extensive unlabeled data of outcome surrogates and high-dimensional predictors. We propose to impute the unobserved outcomes by constructing a sparse imputation model with outcome surrogates and high-dimensional predictors. We further conduct a one-step bias correction to enable interval estimation for the risk prediction. Our inference procedure is valid even if both the imputation and risk prediction models are misspecified. Our novel way of ultilizing unlabelled data enables the high-dimensional statistical inference for the challenging setting with a dense risk prediction model. We present an extensive simulation study to demonstrate the superiority of our approach compared to existing supervised methods. We apply the method to genetic risk prediction of type-2 diabetes mellitus using an EHR biobank cohort.
ProtoryNet - Interpretable Text Classification Via Prototype Trajectories
http://jmlr.org/papers/v24/21-0899.html
http://jmlr.org/papers/volume24/21-0899/21-0899.pdf
2023Dat Hong, Tong Wang, Stephen Baek
We propose a novel interpretable deep neural network for text classification, called ProtoryNet, based on a new concept of prototype trajectories. Motivated by the prototype theory in modern linguistics, ProtoryNet makes a prediction by finding the most similar prototype for each sentence in a text sequence and feeding an RNN backbone with the proximity of each sentence to the corresponding active prototype. The RNN backbone then captures the temporal pattern of the prototypes, which we refer to as prototype trajectories. Prototype trajectories enable intuitive and fine-grained interpretation of the reasoning process of the RNN model, in resemblance to how humans analyze texts. We also design a prototype pruning procedure to reduce the total number of prototypes used by the model for better interpretability. Experiments on multiple public datasets demonstrate that ProtoryNet achieves higher accuracy than the baseline prototype-based deep neural net and narrows the performance gap when compared to state-of-the-art black-box models. In addition, after prototype pruning, the resulting ProtoryNet models only need less than or around 20 prototypes for all datasets, which significantly benefits interpretability. Furthermore, we report survey results indicating that human users find ProtoryNet more intuitive and easier to understand compared to other prototype-based methods.
Distributed Algorithms for U-statistics-based Empirical Risk Minimization
http://jmlr.org/papers/v24/21-0890.html
http://jmlr.org/papers/volume24/21-0890/21-0890.pdf
2023Lanjue Chen, Alan T.K. Wan, Shuyi Zhang, Yong Zhou
Empirical risk minimization, where the underlying loss function depends on a pair of data points, covers a wide range of application areas in statistics including pairwise ranking and survival analysis. The common empirical risk estimator obtained by averaging values of a loss function over all possible pairs of observations is essentially a U-statistic. One well-known problem with minimizing U-statistic type empirical risks, is that the computational complexity of U-statistics increases quadratically with the sample size. When faced with big data, this poses computational challenges as the colossal number of observation pairs virtually prohibits centralized computing to be performed on a single machine. This paper addresses this problem by developing two computationally and statistically efficient methods based on the divide-and-conquer strategy on a decentralized computing system, whereby the data are distributed among machines to perform the tasks. One of these methods is based on a surrogate of the empirical risk, while the other method extends the one-step updating scheme in classical M-estimation to the case of pairwise loss. We show that the proposed estimators are as asymptotically efficient as the benchmark global U-estimator obtained under centralized computing. As well, we introduce two distributed iterative algorithms to facilitate the implementation of the proposed methods, and conduct extensive numerical experiments to demonstrate their merit.
Minimax Estimation for Personalized Federated Learning: An Alternative between FedAvg and Local Training?
http://jmlr.org/papers/v24/21-0224.html
http://jmlr.org/papers/volume24/21-0224/21-0224.pdf
2023Shuxiao Chen, Qinqing Zheng, Qi Long, Weijie J. Su
A widely recognized difficulty in federated learning arises from the statistical heterogeneity among clients: local datasets often originate from distinct yet not entirely unrelated probability distributions, and personalization is, therefore, necessary to achieve optimal results from each individual’s perspective. In this paper, we show how the excess risks of personalized federated learning using a smooth, strongly convex loss depend on data heterogeneity from a minimax point of view, with a focus on the FedAvg algorithm (McMahan et al., 2017) and pure local training (i.e., clients solve empirical risk minimization problems on their local datasets without any communication). Our main result reveals an approximate alternative between these two baseline algorithms for federated learning: the former algorithm is minimax rate optimal over a collection of instances when data heterogeneity is small, whereas the latter is minimax rate optimal when data heterogeneity is large, and the threshold is sharp up to a constant. As an implication, our results show that from a worst-case point of view, a dichotomous strategy that makes a choice between the two baseline algorithms is rate-optimal. Another implication is that the popular FedAvg following by local fine tuning strategy is also minimax optimal under additional regularity conditions. Our analysis relies on a new notion of algorithmic stability that takes into account the nature of federated learning.
Nearest Neighbor Dirichlet Mixtures
http://jmlr.org/papers/v24/21-0116.html
http://jmlr.org/papers/volume24/21-0116/21-0116.pdf
2023Shounak Chattopadhyay, Antik Chakraborty, David B. Dunson
There is a rich literature on Bayesian methods for density estimation, which characterize the unknown density as a mixture of kernels. Such methods have advantages in terms of providing uncertainty quantification in estimation, while being adaptive to a rich variety of densities. However, relative to frequentist locally adaptive kernel methods, Bayesian approaches can be slow and unstable to implement in relying on Markov chain Monte Carlo algorithms. To maintain most of the strengths of Bayesian approaches without the computational disadvantages, we propose a class of nearest neighbor-Dirichlet mixtures. The approach starts by grouping the data into neighborhoods based on standard algorithms. Within each neighborhood, the density is characterized via a Bayesian parametric model, such as a Gaussian with unknown parameters. Assigning a Dirichlet prior to the weights on these local kernels, we obtain a pseudo-posterior for the weights and kernel parameters. A simple and embarrassingly parallel Monte Carlo algorithm is proposed to sample from the resulting pseudo-posterior for the unknown density. Desirable asymptotic properties are shown, and the methods are evaluated in simulation studies and applied to a motivating data set in the context of classification.
Learning to Rank under Multinomial Logit Choice
http://jmlr.org/papers/v24/20-981.html
http://jmlr.org/papers/volume24/20-981/20-981.pdf
2023James A. Grant, David S. Leslie
Learning the optimal ordering of content is an important challenge in website design. The learning to rank (LTR) framework models this problem as a sequential problem of selecting lists of content and observing where users decide to click. Most previous work on LTR assumes that the user considers each item in the list in isolation, and makes binary choices to click or not on each. We introduce a multinomial logit (MNL) choice model to the LTR framework, which captures the behaviour of users who consider the ordered list of items as a whole and make a single choice among all the items and a no-click option. Under the MNL model, the user favours items which are either inherently more attractive, or placed in a preferable position within the list. We propose upper confidence bound (UCB) algorithms to minimise regret in two settings - where the position dependent parameters are known, and unknown. We present theoretical analysis leading to an $\Omega(\sqrt{JT})$ lower bound for the problem, an $\tilde{O}(\sqrt{JT})$ upper bound on regret of the UCB algorithm in the known-parameter setting, and an $\tilde{O}(K^2\sqrt{JT})$ upper bound on regret, the first, in the more challenging unknown-position-parameter setting. Our analyses are based on tight new concentration results for Geometric random variables, and novel functional inequalities for maximum likelihood estimators computed on discrete data
Scalable high-dimensional Bayesian varying coefficient models with unknown within-subject covariance
http://jmlr.org/papers/v24/20-1437.html
http://jmlr.org/papers/volume24/20-1437/20-1437.pdf
2023Ray Bai, Mary R. Boland, Yong Chen
Nonparametric varying coefficient (NVC) models are useful for modeling time-varying effects on responses that are measured repeatedly for the same subjects. When the number of covariates is moderate or large, it is desirable to perform variable selection from the varying coefficient functions. However, existing methods for variable selection in NVC models either fail to account for within-subject correlations or require the practitioner to specify a parametric form for the correlation structure. In this paper, we introduce the nonparametric varying coefficient spike-and-slab lasso (NVC-SSL) for Bayesian high dimensional NVC models. Through the introduction of functional random effects, our method allows for flexible modeling of within-subject correlations without needing to specify a parametric covariance function. We further propose several scalable optimization and Markov chain Monte Carlo (MCMC) algorithms. For variable selection, we propose an Expectation Conditional Maximization (ECM) algorithm to rapidly obtain maximum a posteriori (MAP) estimates. Our ECM algorithm scales linearly in the total number of observations $N$ and the number of covariates $p$. For uncertainty quantification, we introduce an approximate MCMC algorithm that also scales linearly in both $N$ and $p$. We demonstrate the scalability, variable selection performance, and inferential capabilities of our method through simulations and a real data application. These algorithms are implemented in the publicly available R package NVCSSL on the Comprehensive R Archive Network.
Multi-view Collaborative Gaussian Process Dynamical Systems
http://jmlr.org/papers/v24/19-094.html
http://jmlr.org/papers/volume24/19-094/19-094.pdf
2023Shiliang Sun, Jingjing Fei, Jing Zhao, Liang Mao
Gaussian process dynamical systems (GPDSs) have shown their effectiveness in many tasks of machine learning. However, when they address multi-view data, current GPDSs do not explicitly model the dependence between private and shared latent variables. Instead, they introduce structurally and intrinsically discrete segmentation in the latent space. In this paper, we propose the multi-view collaborative Gaussian process dynamical systems (McGPDSs) model, which assumes that the private latent variable for each view is controlled by its dynamical prior and the shared latent variable. The relevance between private and shared latent variables can be automatically learned by optimization in the Bayesian framework. The model is capable of learning an effective latent representation and generating novel data of one view given data of the other view. We evaluate our model on two-view data sets, and our model obtains better performance compared with the state-of-the-art multi-view GPDSs.
Fairlearn: Assessing and Improving Fairness of AI Systems
http://jmlr.org/papers/v24/23-0389.html
http://jmlr.org/papers/volume24/23-0389/23-0389.pdf
2023Hilde Weerts, Miroslav Dudík, Richard Edgar, Adrin Jalali, Roman Lutz, Michael Madaio
Fairlearn is an open source project to help practitioners assess and improve fairness of artificial intelligence (AI) systems. The associated Python library, also named fairlearn, supports evaluation of a model's output across affected populations and includes several algorithms for mitigating fairness issues. Grounded in the understanding that fairness is a sociotechnical challenge, the project integrates learning resources that aid practitioners in considering a system's broader societal context.
Scalable Real-Time Recurrent Learning Using Columnar-Constructive Networks
http://jmlr.org/papers/v24/23-0367.html
http://jmlr.org/papers/volume24/23-0367/23-0367.pdf
2023Khurram Javed, Haseeb Shah, Richard S. Sutton, Martha White
Constructing states from sequences of observations is an important component of reinforcement learning agents. One solution for state construction is to use recurrent neural networks. Back-propagation through time (BPTT), and real-time recurrent learning (RTRL) are two popular gradient-based methods for recurrent learning. BPTT requires complete trajectories of observations before it can compute the gradients and is unsuitable for online updates. RTRL can do online updates but scales poorly to large networks. In this paper, we propose two constraints that make RTRL scalable. We show that by either decomposing the network into independent modules or learning the network in stages, we can make RTRL scale linearly with the number of parameters. Unlike prior scalable gradient estimation algorithms, such as UORO and Truncated-BPTT, our algorithms do not add noise or bias to the gradient estimate. Instead, they trade off the functional capacity of the network for computationally efficient learning. We demonstrate the effectiveness of our approach over Truncated-BPTT on a prediction benchmark inspired by animal learning and by doing policy evaluation of pre-trained policies for Atari 2600 games.
Torchhd: An Open Source Python Library to Support Research on Hyperdimensional Computing and Vector Symbolic Architectures
http://jmlr.org/papers/v24/23-0300.html
http://jmlr.org/papers/volume24/23-0300/23-0300.pdf
2023Mike Heddes, Igor Nunes, Pere Vergés, Denis Kleyko, Danny Abraham, Tony Givargis, Alexandru Nicolau, Alexander Veidenbaum
Hyperdimensional computing (HD), also known as vector symbolic architectures (VSA), is a framework for computing with distributed representations by exploiting properties of random high-dimensional vector spaces. The commitment of the scientific community to aggregate and disseminate research in this particularly multidisciplinary area has been fundamental for its advancement. Joining these efforts, we present Torchhd, a high-performance open source Python library for HD/VSA. Torchhd seeks to make HD/VSA more accessible and serves as an efficient foundation for further research and application development. The easy-to-use library builds on top of PyTorch and features state-of-the-art HD/VSA functionality, clear documentation, and implementation examples from well-known publications. Comparing publicly available code with their corresponding Torchhd implementation shows that experiments can run up to 100x faster. Torchhd is available at: https://github.com/hyperdimensional-computing/torchhd.
skrl: Modular and Flexible Library for Reinforcement Learning
http://jmlr.org/papers/v24/23-0112.html
http://jmlr.org/papers/volume24/23-0112/23-0112.pdf
2023Antonio Serrano-Muñoz, Dimitrios Chrysostomou, Simon Bøgh, Nestor Arana-Arexolaleiba
skrl is an open-source modular library for reinforcement learning written in Python and designed with a focus on readability, simplicity, and transparency of algorithm implementations. In addition to supporting environments that use the traditional interfaces from OpenAI Gym/Farama Gymnasium, DeepMind and others, it provides the facility to load, configure, and operate NVIDIA Isaac Gym, Isaac Orbit, and Omniverse Isaac Gym environments. Furthermore, it enables the simultaneous training of several agents with customizable scopes (subsets of environments among all available ones), which may or may not share resources, in the same run. The library's documentation can be found at https://skrl.readthedocs.io and its source code is available on GitHub at https://github.com/Toni-SM/skrl.
Estimating the Carbon Footprint of BLOOM, a 176B Parameter Language Model
http://jmlr.org/papers/v24/23-0069.html
http://jmlr.org/papers/volume24/23-0069/23-0069.pdf
2023Alexandra Sasha Luccioni, Sylvain Viguier, Anne-Laure Ligozat
Progress in machine learning (ML) comes with a cost to the environment, given that training ML models requires computational resources, energy and materials. In the present article, we aim to quantify the carbon footprint of BLOOM, a 176-billion parameter language model, across its life cycle. We estimate that BLOOM’s final training emitted approximately 24.7 tonnes of CO2eq if we consider only the dynamic power consumption, and 50.5 tonnes if we account for all processes ranging from equipment manufacturing to energy-based operational consumption. We also carry out an empirical study to measure the energy requirements and carbon emissions of its deployment for inference via an API endpoint receiving user queries in real-time. We conclude with a discussion regarding the difficulty of precisely estimating the carbon footprint of ML models and future research directions that can contribute towards improving carbon emissions reporting.
Adaptive False Discovery Rate Control with Privacy Guarantee
http://jmlr.org/papers/v24/23-0039.html
http://jmlr.org/papers/volume24/23-0039/23-0039.pdf
2023Xintao Xia, Zhanrui Cai
Differentially private multiple testing procedures can protect the information of individuals used in hypothesis tests while guaranteeing a small fraction of false discoveries. In this paper, we propose a differentially private adaptive FDR control method that can control the classic FDR metric exactly at a user-specified level $\alpha$ with a privacy guarantee, which is a non-trivial improvement compared to the differentially private Benjamini-Hochberg method proposed in Dwork et al. (2021). Our analysis is based on two key insights: 1) a novel $p$-value transformation that preserves both privacy and the mirror conservative property, and 2) a mirror peeling algorithm that allows the construction of the filtration and application of the optimal stopping technique. Numerical studies demonstrate that the proposed DP-AdaPT performs better compared to the existing differentially private FDR control methods. Compared to the non-private AdaPT, it incurs a small accuracy loss but significantly reduces the computation cost.
Atlas: Few-shot Learning with Retrieval Augmented Language Models
http://jmlr.org/papers/v24/23-0037.html
http://jmlr.org/papers/volume24/23-0037/23-0037.pdf
2023Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, Edouard Grave
Large language models have shown impressive few-shot results on a wide range of tasks. However, when knowledge is key for such results, as is the case for tasks such as question answering and fact checking, massive parameter counts to store knowledge seem to be needed. Retrieval-augmented models are known to excel at knowledge intensive tasks without the need for as many parameters, but it is unclear whether they work in few-shot settings. In this work we present Atlas, a carefully designed and pre-trained retrieval-augmented language model able to learn knowledge intensive tasks with very few training examples. We perform evaluations on a wide range of tasks, including MMLU, KILT and Natural Questions, and study the impact of the content of the document index, showing that it can easily be updated. Notably, Atlas reaches over 42% accuracy on Natural Questions using only 64 examples, outperforming a 540B parameter model by 3% despite having 50x fewer parameters.
Convex Reinforcement Learning in Finite Trials
http://jmlr.org/papers/v24/22-1514.html
http://jmlr.org/papers/volume24/22-1514/22-1514.pdf
2023Mirco Mutti, Riccardo De Santi, Piersilvio De Bartolomeis, Marcello Restelli
Convex Reinforcement Learning (RL) is a recently introduced framework that generalizes the standard RL objective to any convex (or concave) function of the state distribution induced by the agent's policy. This framework subsumes several applications of practical interest, such as pure exploration, imitation learning, and risk-averse RL, among others. However, the previous convex RL literature implicitly evaluates the agent's performance over infinite realizations (or trials), while most of the applications require excellent performance over a handful, or even just one, trials. To meet this practical demand, we formulate convex RL in finite trials, where the objective is any convex function of the empirical state distribution computed over a finite number of realizations. In this paper, we provide a comprehensive theoretical study of the setting, which includes an analysis of the importance of non-Markovian policies to achieve optimality, as well as a characterization of the computational and statistical complexity of the problem in various configurations.
Unbiased Multilevel Monte Carlo Methods for Intractable Distributions: MLMC Meets MCMC
http://jmlr.org/papers/v24/22-1468.html
http://jmlr.org/papers/volume24/22-1468/22-1468.pdf
2023Tianze Wang, Guanyang Wang
Constructing unbiased estimators from Markov chain Monte Carlo (MCMC) outputs is a difficult problem that has recently received a lot of attention in the statistics and machine learning communities. However, the current unbiased MCMC framework only works when the quantity of interest is an expectation, which excludes many practical applications. In this paper, we propose a general method for constructing unbiased estimators for functions of expectations and extend it to construct unbiased estimators for nested expectations. Our approach combines and generalizes the unbiased MCMC and Multilevel Monte Carlo (MLMC) methods. In contrast to traditional sequential methods, our estimator can be implemented on parallel processors. We show that our estimator has a finite variance and computational complexity and can achieve $\varepsilon$-accuracy within the optimal $O(1/\varepsilon^2)$ computational cost under mild conditions. Numerical experiments confirm our theoretical findings and demonstrate the benefits of unbiased estimators in the massively parallel regime.
Improving multiple-try Metropolis with local balancing
http://jmlr.org/papers/v24/22-1351.html
http://jmlr.org/papers/volume24/22-1351/22-1351.pdf
2023Philippe Gagnon, Florian Maire, Giacomo Zanella
Multiple-try Metropolis (MTM) is a popular Markov chain Monte Carlo method with the appealing feature of being amenable to parallel computing. At each iteration, it samples several candidates for the next state of the Markov chain and randomly selects one of them based on a weight function. The canonical weight function is proportional to the target density. We show both theoretically and empirically that this weight function induces pathological behaviours in high dimensions, especially during the convergence phase. We propose to instead use weight functions akin to the locally-balanced proposal distributions of Zanella (2020), thus yielding MTM algorithms that do not exhibit those pathological behaviours. To theoretically analyse these algorithms, we study the high-dimensional performance of ideal schemes that can be thought of as MTM algorithms which sample an infinite number of candidates at each iteration, as well as the discrepancy between such schemes and the MTM algorithms which sample a finite number of candidates. Our analysis unveils a strong distinction between the convergence and stationary phases: in the former, local balancing is crucial and effective to achieve fast convergence, while in the latter, the canonical and novel weight functions yield similar performance. Numerical experiments include an application in precision medicine involving a computationally-expensive forward model, which makes the use of parallel computing within MTM iterations beneficial.
Importance Sparsification for Sinkhorn Algorithm
http://jmlr.org/papers/v24/22-1311.html
http://jmlr.org/papers/volume24/22-1311/22-1311.pdf
2023Mengyu Li, Jun Yu, Tao Li, Cheng Meng
Sinkhorn algorithm has been used pervasively to approximate the solution to optimal transport (OT) and unbalanced optimal transport (UOT) problems. However, its practical application is limited due to the high computational complexity. To alleviate the computational burden, we propose a novel importance sparsification method, called Spar-Sink, to efficiently approximate entropy-regularized OT and UOT solutions. Specifically, our method employs natural upper bounds for unknown optimal transport plans to establish effective sampling probabilities, and constructs a sparse kernel matrix to accelerate Sinkhorn iterations, reducing the computational cost of each iteration from $O(n^2)$ to $\widetilde{O}(n)$ for a sample of size $n$. Theoretically, we show the proposed estimators for the regularized OT and UOT problems are consistent under mild regularity conditions. Experiments on various synthetic data demonstrate Spar-Sink outperforms mainstream competitors in terms of both estimation error and speed. A real-world echocardiogram data analysis shows Spar-Sink can effectively estimate and visualize cardiac cycles, from which one can identify heart failure and arrhythmia. To evaluate the numerical accuracy of cardiac cycle prediction, we consider the task of predicting the end-systole time point using the end-diastole one. Results show Spar-Sink performs as well as the classical Sinkhorn algorithm, requiring significantly less computational time.
Graph Attention Retrospective
http://jmlr.org/papers/v24/22-125.html
http://jmlr.org/papers/volume24/22-125/22-125.pdf
2023Kimon Fountoulakis, Amit Levi, Shenghao Yang, Aseem Baranwal, Aukosh Jagannath
Graph-based learning is a rapidly growing sub-field of machine learning with applications in social networks, citation networks, and bioinformatics. One of the most popular models is graph attention networks. They were introduced to allow a node to aggregate information from features of neighbor nodes in a non-uniform way, in contrast to simple graph convolution which does not distinguish the neighbors of a node. In this paper, we theoretically study the behaviour of graph attention networks. We prove multiple results on the performance of the graph attention mechanism for the problem of node classification for a contextual stochastic block model. Here, the node features are obtained from a mixture of Gaussians and the edges from a stochastic block model. We show that in an "easy" regime, where the distance between the means of the Gaussians is large enough, graph attention is able to distinguish inter-class from intra-class edges. Thus it maintains the weights of important edges and significantly reduces the weights of unimportant edges. Consequently, we show that this implies perfect node classification. In the "hard" regime, we show that every attention mechanism fails to distinguish intra-class from inter-class edges. In addition, we show that graph attention convolution cannot (almost) perfectly classify the nodes even if intra-class edges could be separated from inter-class edges. Beyond perfect node classification, we provide a positive result on graph attention's robustness against structural noise in the graph. In particular, our robustness result implies that graph attention can be strictly better than both the simple graph convolution and the best linear classifier of node features. We evaluate our theoretical results on synthetic and real-world data.
Confidence Intervals and Hypothesis Testing for High-dimensional Quantile Regression: Convolution Smoothing and Debiasing
http://jmlr.org/papers/v24/22-1217.html
http://jmlr.org/papers/volume24/22-1217/22-1217.pdf
2023Yibo Yan, Xiaozhou Wang, Riquan Zhang
$\ell_1$-penalized quantile regression ($\ell_1$-QR) is a useful tool for modeling the relationship between input and output variables when detecting heterogeneous effects in the high-dimensional setting. Hypothesis tests can then be formulated based on the debiased $\ell_1$-QR estimator that reduces the bias induced by Lasso penalty. However, the non-smoothness of the quantile loss brings great challenges to the computation, especially when the data dimension is high. Recently, the convolution-type smoothed quantile regression (SQR) model has been proposed to overcome such shortcoming, and people developed theory of estimation and variable selection therein. In this work, we combine the debiased method with SQR model and come up with the debiased $\ell_1$-SQR estimator, based on which we then establish confidence intervals and hypothesis testing in the high-dimensional setup. Theoretically, we provide the non-asymptotic Bahadur representation for our proposed estimator and also the Berry-Esseen bound, which implies the empirical coverage rates for the studentized confidence intervals. Furthermore, we build up the theory of hypothesis testing on both a single variable and a group of variables. Finally, we exhibit extensive numerical experiments on both simulated and real data to demonstrate the good performance of our method.
Selection by Prediction with Conformal p-values
http://jmlr.org/papers/v24/22-1176.html
http://jmlr.org/papers/volume24/22-1176/22-1176.pdf
2023Ying Jin, Emmanuel J. Candes
Decision making or scientific discovery pipelines such as job hiring and drug discovery often involve multiple stages: before any resource-intensive step, there is often an initial screening that uses predictions from a machine learning model to shortlist a few candidates from a large pool. We study screening procedures that aim to select candidates whose unobserved outcomes exceed user-specified values. We develop a method that wraps around any prediction model to produce a subset of candidates while controlling the proportion of falsely selected units. Building upon the conformal inference framework, our method first constructs p-values that quantify the statistical evidence for large outcomes; it then determines the shortlist by comparing the p-values to a threshold introduced in the multiple testing literature. In many cases, the procedure selects candidates whose predictions are above a data-dependent threshold. Our theoretical guarantee holds under mild exchangeability conditions on the samples, generalizing existing results on multiple conformal p-values. We demonstrate the empirical performance of our method via simulations, and apply it to job hiring and drug discovery datasets.
Alpha-divergence Variational Inference Meets Importance Weighted Auto-Encoders: Methodology and Asymptotics
http://jmlr.org/papers/v24/22-1160.html
http://jmlr.org/papers/volume24/22-1160/22-1160.pdf
2023Kamélia Daudel, Joe Benton, Yuyang Shi, Arnaud Doucet
Several algorithms involving the Variational Rényi (VR) bound have been proposed to minimize an alpha-divergence between a target posterior distribution and a variational distribution. Despite promising empirical results, those algorithms resort to biased stochastic gradient descent procedures and thus lack theoretical guarantees. In this paper, we formalize and study the VR-IWAE bound, a generalization of the importance weighted auto-encoder (IWAE) bound. We show that the VR-IWAE bound enjoys several desirable properties and notably leads to the same stochastic gradient descent procedure as the VR bound in the reparameterized case, but this time by relying on unbiased gradient estimators. We then provide two complementary theoretical analyses of the VR-IWAE bound and thus of the standard IWAE bound. Those analyses shed light on the benefits or lack thereof of these bounds. Lastly, we illustrate our theoretical claims over toy and real-data examples.
Sparse Graph Learning from Spatiotemporal Time Series
http://jmlr.org/papers/v24/22-1154.html
http://jmlr.org/papers/volume24/22-1154/22-1154.pdf
2023Andrea Cini, Daniele Zambon, Cesare Alippi
Outstanding achievements of graph neural networks for spatiotemporal time series analysis show that relational constraints introduce an effective inductive bias into neural forecasting architectures. Often, however, the relational information characterizing the underlying data-generating process is unavailable and the practitioner is left with the problem of inferring from data which relational graph to use in the subsequent processing stages. We propose novel, principled - yet practical - probabilistic score-based methods that learn the relational dependencies as distributions over graphs while maximizing end-to-end the performance at task. The proposed graph learning framework is based on consolidated variance reduction techniques for Monte Carlo score-based gradient estimation, is theoretically grounded, and, as we show, effective in practice. In this paper, we focus on the time series forecasting problem and show that, by tailoring the gradient estimators to the graph learning problem, we are able to achieve state-of-the-art performance while controlling the sparsity of the learned graph and the computational scalability. We empirically assess the effectiveness of the proposed method on synthetic and real-world benchmarks, showing that the proposed solution can be used as a stand-alone graph identification procedure as well as a graph learning component of an end-to-end forecasting architecture.
Improved Powered Stochastic Optimization Algorithms for Large-Scale Machine Learning
http://jmlr.org/papers/v24/22-1149.html
http://jmlr.org/papers/volume24/22-1149/22-1149.pdf
2023Zhuang Yang
Stochastic optimization, especially stochastic gradient descent (SGD), is now the workhorse for the vast majority of problems in machine learning. Various strategies, e.g., control variates, adaptive learning rate, momentum technique, etc., have been developed to improve canonical SGD that is of a low convergence rate and the poor generalization in practice. Most of these strategies improve SGD that can be attributed to control the updating direction (e.g., gradient descent or gradient ascent direction), or manipulate the learning rate. Along these two lines, this work first develops and analyzes a novel type of improved powered stochastic gradient descent algorithms from the perspectives of variance reduction, where the updating direction was determined by the Powerball function. Additionally, to bridge the gap between powered stochastic optimization (PSO) and the learning rate, which is now still an open problem for PSO, we propose an adaptive mechanism of updating the learning rate that resorts the Barzilai-Borwein (BB) like scheme, not only for the proposed algorithm, but also for classical PSO algorithms. The theoretical properties of the resulting algorithms for non-convex optimization problems are technically analyzed. Empirical tests using various benchmark data sets indicate the efficiency and robustness of our proposed algorithms.
PaLM: Scaling Language Modeling with Pathways
http://jmlr.org/papers/v24/22-1144.html
http://jmlr.org/papers/volume24/22-1144/22-1144.pdf
2023Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, Noah Fiedel
Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model (PaLM). We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies.
Leaky Hockey Stick Loss: The First Negatively Divergent Margin-based Loss Function for Classification
http://jmlr.org/papers/v24/22-1104.html
http://jmlr.org/papers/volume24/22-1104/22-1104.pdf
2023Oh-Ran Kwon, Hui Zou
Many modern classification algorithms are formulated through the regularized empirical risk minimization (ERM) framework, where the risk is defined based on a loss function. We point out that although the loss function in decision theory is non-negative by definition, the non-negativity of the loss function in ERM is not necessary to be classification-calibrated and to produce a Bayes consistent classifier. We introduce the leaky hockey stick loss (LHS loss), the first negatively divergent margin-based loss function. We prove that the LHS loss is classification-calibrated. When the hinge loss is replaced with the LHS loss in the ERM approach for deriving the kernel support vector machine (SVM), the corresponding optimization problem has a well-defined solution named the kernel leaky hockey stick classifier (LHS classifier). Under mild regularity conditions, we prove that the kernel LHS classifier is Bayes risk consistent. In our theoretical analysis, we overcome multiple challenges caused by the negative divergence of the LHS loss that does not exist in the analysis of the usual kernel machines. For a numerical demonstration, we provide a computationally efficient algorithm to solve the kernel LHS classifier and compare it to the kernel SVM on simulated data and fifteen benchmark data sets. To conclude this work, we further present a class of negatively divergent margin-based loss functions that have similar theoretical properties to those of the LHS loss. Interestingly, the LHS loss can be viewed as a limiting case of this family of negatively divergent margin-based loss functions.
Efficient Computation of Rankings from Pairwise Comparisons
http://jmlr.org/papers/v24/22-1086.html
http://jmlr.org/papers/volume24/22-1086/22-1086.pdf
2023M. E. J. Newman
We study the ranking of individuals, teams, or objects, based on pairwise comparisons between them, using the Bradley-Terry model. Estimates of rankings within this model are commonly made using a simple iterative algorithm first introduced by Zermelo almost a century ago. Here we describe an alternative and similarly simple iteration that provably returns identical results but does so much faster—over a hundred times faster in some cases. We demonstrate this algorithm with applications to a range of example data sets and derive a number of results regarding its convergence.
Scalable Computation of Causal Bounds
http://jmlr.org/papers/v24/22-1081.html
http://jmlr.org/papers/volume24/22-1081/22-1081.pdf
2023Madhumitha Shridharan, Garud Iyengar
We consider the problem of computing bounds for causal queries on causal graphs with unobserved confounders and discrete valued observed variables, where identifiability does not hold. Existing non-parametric approaches for computing such bounds use linear programming (LP) formulations that quickly become intractable for existing solvers because the size of the LP grows exponentially in the number of edges in the causal graph. We show that this LP can be significantly pruned, allowing us to compute bounds for significantly larger causal inference problems compared to existing techniques. This pruning procedure allows us to compute bounds in closed form for a special class of problems, including a well-studied family of problems where multiple confounded treatments influence an outcome. We extend our pruning methodology to fractional LPs which compute bounds for causal queries which incorporate additional observations about the unit. We show that our methods provide significant runtime improvement compared to benchmarks in experiments and extend our results to the finite data setting. For causal inference without additional observations, we propose an efficient greedy heuristic that produces high quality bounds, and scales to problems that are several orders of magnitude larger than those for which the pruned LP can be solved.
Neural Q-learning for solving PDEs
http://jmlr.org/papers/v24/22-1075.html
http://jmlr.org/papers/volume24/22-1075/22-1075.pdf
2023Samuel N. Cohen, Deqing Jiang, Justin Sirignano
Solving high-dimensional partial differential equations (PDEs) is a major challenge in scientific computing. We develop a new numerical method for solving elliptic-type PDEs by adapting the Q-learning algorithm in reinforcement learning. To solve PDEs with Dirichlet boundary condition, our “Q-PDE" algorithm is mesh-free and therefore has the potential to overcome the curse of dimensionality. Using a neural tangent kernel (NTK) approach, we prove that the neural network approximator for the PDE solution, trained with the Q-PDE algorithm, converges to the trajectory of an infinite-dimensional ordinary differential equation (ODE) as the number of hidden units $\rightarrow \infty$. For monotone PDEs (i.e., those given by monotone operators, which may be nonlinear), despite the lack of a spectral gap in the NTK, we then prove that the limit neural network, which satisfies the infinite-dimensional ODE, strongly converges in $L^2$ to the PDE solution as the training time $\rightarrow \infty$. More generally, we can prove that any fixed point of the wide-network limit for the Q-PDE algorithm is a solution of the PDE (not necessarily under the monotone condition). The numerical performance of the Q-PDE algorithm is studied for several elliptic PDEs.
Tractable and Near-Optimal Adversarial Algorithms for Robust Estimation in Contaminated Gaussian Models
http://jmlr.org/papers/v24/22-1053.html
http://jmlr.org/papers/volume24/22-1053/22-1053.pdf
2023Ziyue Wang, Zhiqiang Tan
Consider the problem of simultaneous estimation of location and variance matrix under Huber's contaminated Gaussian model. First, we study minimum $f$-divergence estimation at the population level, corresponding to a generative adversarial method with a nonparametric discriminator and establish conditions on $f$-divergences which lead to robust estimation, similarly to robustness of minimum distance estimation. More importantly, we develop tractable adversarial algorithms with simple spline discriminators, which can be defined by nested optimization such that the discriminator parameters are determined by maximizing a concave objective function given the current generator. The proposed methods are shown to achieve minimax optimal rates or near-optimal rates depending on the $f$-divergence and the penalty used. This is the first time such near-optimal error rates are established for adversarial algorithms with linear discriminators under Huber's contamination model. We present simulation studies to demonstrate advantages of the proposed methods over classic robust estimators, pairwise methods, and a generative adversarial method with neural network discriminators.
MultiZoo and MultiBench: A Standardized Toolkit for Multimodal Deep Learning
http://jmlr.org/papers/v24/22-1021.html
http://jmlr.org/papers/volume24/22-1021/22-1021.pdf
2023Paul Pu Liang, Yiwei Lyu, Xiang Fan, Arav Agarwal, Yun Cheng, Louis-Philippe Morency, Ruslan Salakhutdinov
Learning multimodal representations involves integrating information from multiple heterogeneous sources of data. In order to accelerate progress towards understudied modalities and tasks while ensuring real-world robustness, we release MultiZoo, a public toolkit consisting of standardized implementations of >20 core multimodal algorithms and MultiBench, a large-scale benchmark spanning 15 datasets, 10 modalities, 20 prediction tasks, and 6 research areas. Together, these provide an automated end-to-end machine learning pipeline that simplifies and standardizes data loading, experimental setup, and model evaluation. To enable holistic evaluation, we offer a comprehensive methodology to assess (1) generalization, (2) time and space complexity, and (3) modality robustness. MultiBench paves the way towards a better understanding of the capabilities and limitations of multimodal models, while ensuring ease of use, accessibility, and reproducibility. Our toolkits are publicly available, will be regularly updated, and welcome inputs from the community.
Strategic Knowledge Transfer
http://jmlr.org/papers/v24/22-0968.html
http://jmlr.org/papers/volume24/22-0968/22-0968.pdf
2023Max Olan Smith, Thomas Anthony, Michael P. Wellman
In the course of playing or solving a game, it is common to face a series of changing other-agent strategies. These strategies often share elements: the set of possible policies to play has overlap, and the policies are sampled at the beginning of play by possibly differing distributions. As it faces the series of strategies, therefore, an agent has the opportunity to transfer its learned play against the previously encountered other-agent policies. We tackle two problems: (1) how can learned responses transfer across changing opponent strategies, and (2) how can this transfer be used to reduced the cumulative cost of learning in game solving. The first problem we characterize as the strategic knowledge transfer problem. For value-based response policies, we demonstrate that Q-Mixing approximately solves this problem by appropriately averaging the component Q-values. Solutions to the first problem can be applied to reduce the computational cost of learning-based game solving algorithms. We offer two algorithms that operate within the Policy-Space Response Oracles (PSRO) framework. Mixed-Oracles reduces the per-policy construction cost by transferring responses from previously encountered opponents. Mixed-Opponents performs strategic knowledge transfer by combining the previously encountered opponents into a single novel policy. Experimental evaluation of these methods on general-sum grid-world games provide evidence about their advantages and limitations in comparison to standard PSRO.
Lifted Bregman Training of Neural Networks
http://jmlr.org/papers/v24/22-0934.html
http://jmlr.org/papers/volume24/22-0934/22-0934.pdf
2023Xiaoyu Wang, Martin Benning
We introduce a novel mathematical formulation for the training of feed-forward neural networks with (potentially non-smooth) proximal maps as activation functions. This formulation is based on Bregman distances and a key advantage is that its partial derivatives with respect to the network’s parameters do not require the computation of derivatives of the network's activation functions. Instead of estimating the parameters with a combination of first-order optimisation method and back-propagation (as is the state-of-the-art), we propose the use of non-smooth first-order optimisation methods that exploit the specific structure of the novel formulation. We present several numerical results that demonstrate that these training approaches can be equally well or even better suited for the training of neural network-based classifiers and (denoising) autoencoders with sparse coding compared to more conventional training frameworks.
Statistical Comparisons of Classifiers by Generalized Stochastic Dominance
http://jmlr.org/papers/v24/22-0902.html
http://jmlr.org/papers/volume24/22-0902/22-0902.pdf
2023Christoph Jansen, Malte Nalenz, Georg Schollmeyer, Thomas Augustin
Although being a crucial question for the development of machine learning algorithms, there is still no consensus on how to compare classifiers over multiple data sets with respect to several criteria. Every comparison framework is confronted with (at least) three fundamental challenges: the multiplicity of quality criteria, the multiplicity of data sets and the randomness of the selection of data sets. In this paper, we add a fresh view to the vivid debate by adopting recent developments in decision theory. Based on so-called preference systems, our framework ranks classifiers by a generalized concept of stochastic dominance, which powerfully circumvents the cumbersome, and often even self-contradictory, reliance on aggregates. Moreover, we show that generalized stochastic dominance can be operationalized by solving easy-to-handle linear programs and moreover statistically tested employing an adapted two-sample observation-randomization test. This yields indeed a powerful framework for the statistical comparison of classifiers over multiple data sets with respect to multiple quality criteria simultaneously. We illustrate and investigate our framework in a simulation study and with a set of standard benchmark data sets.
Sample Complexity for Distributionally Robust Learning under chi-square divergence
http://jmlr.org/papers/v24/22-0881.html
http://jmlr.org/papers/volume24/22-0881/22-0881.pdf
2023Zhengyu Zhou, Weiwei Liu
This paper investigates the sample complexity of learning a distributionally robust predictor under a particular distributional shift based on $\chi^2$-divergence, which is well known for its computational feasibility and statistical properties. We demonstrate that any hypothesis class $\mathcal{H}$ with finite VC dimension is distributionally robustly learnable. Moreover, we show that when the perturbation size is smaller than a constant, finite VC dimension is also necessary for distributionally robust learning by deriving a lower bound of sample complexity in terms of VC dimension.
Interpretable and Fair Boolean Rule Sets via Column Generation
http://jmlr.org/papers/v24/22-0880.html
http://jmlr.org/papers/volume24/22-0880/22-0880.pdf
2023Connor Lawless, Sanjeeb Dash, Oktay Gunluk, Dennis Wei
This paper considers the learning of Boolean rules in disjunctive normal form (DNF, OR-of-ANDs, equivalent to decision rule sets) as an interpretable model for classification. An integer program is formulated to optimally trade classification accuracy for rule simplicity. We also consider the fairness setting and extend the formulation to include explicit constraints on two different measures of classification parity: equality of opportunity and equalized odds. Column generation (CG) is used to efficiently search over an exponential number of candidate rules without the need for heuristic rule mining. To handle large data sets, we propose an approximate CG algorithm using randomization. Compared to three recently proposed alternatives, the CG algorithm dominates the accuracy-simplicity trade-off in 8 out of 16 data sets. When maximized for accuracy, CG is competitive with rule learners designed for this purpose, sometimes finding significantly simpler solutions that are no less accurate. Compared to other fair and interpretable classifiers, our method is able to find rule sets that meet stricter notions of fairness with a modest trade-off in accuracy.
On the Optimality of Nuclear-norm-based Matrix Completion for Problems with Smooth Non-linear Structure
http://jmlr.org/papers/v24/22-0850.html
http://jmlr.org/papers/volume24/22-0850/22-0850.pdf
2023Yunhua Xiang, Tianyu Zhang, Xu Wang, Ali Shojaie, Noah Simon
Nuclear-norm-based matrix completion was originally developed for imputing missing entries in low rank, or approximately low rank matrices. However, it has proven widely effective in many problems where there is no reason to assume low-dimensional linear structure in the underlying matrix, as would be imposed by rank constraints. In this manuscript we show that nuclear-norm-based matrix completion attains within a log factor of the minimax rate for estimating the mean structure of matrices that are not necessarily low-rank, but lie in a low-dimensional non-linear manifold, when observations are missing completely at random. In particular, we give upper bounds on the rate of convergence as a function of the number of rows, columns, and observed entries in the matrix, as well as the smoothness and dimension of the non-linear embedding. We additionally give a minimax lower bound: This lower bound agrees with our upper bound (up to a logarithmic factor), which shows that nuclear-norm penalization is (up to log terms) minimax rate optimal for these problems.
Autoregressive Networks
http://jmlr.org/papers/v24/22-0845.html
http://jmlr.org/papers/volume24/22-0845/22-0845.pdf
2023Binyan Jiang, Jialiang Li, Qiwei Yao
We propose a first-order autoregressive (i.e. AR(1)) model for dynamic network processes in which edges change over time while nodes remain unchanged. The model depicts the dynamic changes explicitly. It also facilitates simple and efficient statistical inference methods including a permutation test for diagnostic checking for the fitted network models. The proposed model can be applied to the network processes with various underlying structures but with independent edges. As an illustration, an AR(1) stochastic block model has been investigated in depth, which characterizes the latent communities by the transition probabilities over time. This leads to a new and more effective spectral clustering algorithm for identifying the latent communities. We have derived a finite sample condition under which the perfect recovery of the community structure can be achieved by the newly defined spectral clustering algorithm. Furthermore the inference for a change point is incorporated into the AR(1) stochastic block model to cater for possible structure changes. We have derived the explicit error rates for the maximum likelihood estimator of the change-point. Application with three real data sets illustrates both relevance and usefulness of the proposed AR(1) models and the associate inference methods.
Merlion: End-to-End Machine Learning for Time Series
http://jmlr.org/papers/v24/22-0809.html
http://jmlr.org/papers/volume24/22-0809/22-0809.pdf
2023Aadyot Bhatnagar, Paul Kassianik, Chenghao Liu, Tian Lan, Wenzhuo Yang, Rowan Cassius, Doyen Sahoo, Devansh Arpit, Sri Subramanian, Gerald Woo, Amrita Saha, Arun Kumar Jagota, Gokulakrishnan Gopalakrishnan, Manpreet Singh, K C Krithika, Sukumar Maddineni, Daeki Cho, Bo Zong, Yingbo Zhou, Caiming Xiong, Silvio Savarese, Steven Hoi, Huan Wang
We introduce Merlion, an open-source machine learning library for time series. It features a unified interface for many commonly used models and datasets for forecasting and anomaly detection on both univariate and multivariate time series, along with standard pre/post-processing layers. It has several modules to improve ease-of-use, including a no-code visual dashboard, anomaly score calibration to improve interpetability, AutoML for hyperparameter tuning and model selection, and model ensembling. Merlion also provides an evaluation framework that simulates the live deployment of a model in production, and a distributed computing backend to run time series models at industrial scale. This library aims to provide engineers and researchers a one-stop solution to rapidly develop models for their specific time series needs and benchmark them across multiple datasets.
Limits of Dense Simplicial Complexes
http://jmlr.org/papers/v24/22-0808.html
http://jmlr.org/papers/volume24/22-0808/22-0808.pdf
2023T. Mitchell Roddenberry, Santiago Segarra
We develop a theory of limits for sequences of dense abstract simplicial complexes, where a sequence is considered convergent if its homomorphism densities converge. The limiting objects are represented by stacks of measurable $[0,1]$-valued functions on unit cubes of increasing dimension, each corresponding to a dimension of the abstract simplicial complex. We show that convergence in homomorphism density implies convergence in a cut-metric, and vice versa, as well as showing that simplicial complexes sampled from the limit objects closely resemble its structure. Applying this framework, we also partially characterize the convergence of nonuniform hypergraphs.
RankSEG: A Consistent Ranking-based Framework for Segmentation
http://jmlr.org/papers/v24/22-0712.html
http://jmlr.org/papers/volume24/22-0712/22-0712.pdf
2023Ben Dai, Chunlin Li
Segmentation has emerged as a fundamental field of computer vision and natural language processing, which assigns a label to every pixel/feature to extract regions of interest from an image/text. To evaluate the performance of segmentation, the Dice and IoU metrics are used to measure the degree of overlap between the ground truth and the predicted segmentation. In this paper, we establish a theoretical foundation of segmentation with respect to the Dice/IoU metrics, including the Bayes rule and Dice-/IoU-calibration, analogous to classification-calibration or Fisher consistency in classification. We prove that the existing thresholding-based framework with most operating losses are not consistent with respect to the Dice/IoU metrics, and thus may lead to a suboptimal solution. To address this pitfall, we propose a novel consistent ranking-based framework, namely RankDice/RankIoU, inspired by plug-in rules of the Bayes segmentation rule. Three numerical algorithms with GPU parallel execution are developed to implement the proposed framework in large-scale and high-dimensional segmentation. We study statistical properties of the proposed framework. We show it is Dice-/IoU-calibrated, and its excess risk bounds and the rate of convergence are also provided. The numerical effectiveness of RankDice/mRankDice is demonstrated in various simulated examples and Fine-annotated CityScapes, Pascal VOC and Kvasir-SEG datasets with state-of-the-art deep learning architectures. Python module and source code are available on Github at (https://github.com/statmlben/rankseg).
Conditional Distribution Function Estimation Using Neural Networks for Censored and Uncensored Data
http://jmlr.org/papers/v24/22-0657.html
http://jmlr.org/papers/volume24/22-0657/22-0657.pdf
2023Bingqing Hu, Bin Nan
Most work in neural networks focuses on estimating the conditional mean of a continuous response variable given a set of covariates. In this article, we consider estimating the conditional distribution function using neural networks for both censored and uncensored data. The algorithm is built upon the data structure particularly constructed for the Cox regression with time-dependent covariates. Without imposing any model assumptions, we consider a loss function that is based on the full likelihood where the conditional hazard function is the only unknown nonparametric parameter, for which unconstrained optimization methods can be applied. Through simulation studies, we show that the proposed method possesses desirable performance, whereas the partial likelihood method and the traditional neural networks with $L_2$ loss yields biased estimates when model assumptions are violated. We further illustrate the proposed method with several real-world data sets.
Single Timescale Actor-Critic Method to Solve the Linear Quadratic Regulator with Convergence Guarantees
http://jmlr.org/papers/v24/22-0644.html
http://jmlr.org/papers/volume24/22-0644/22-0644.pdf
2023Mo Zhou, Jianfeng Lu
We propose a single timescale actor-critic algorithm to solve the linear quadratic regulator (LQR) problem. A least squares temporal difference (LSTD) method is applied to the critic and a natural policy gradient method is used for the actor. We give a proof of convergence with sample complexity $\mathcal{O}(\varepsilon^{-1} \log(\varepsilon^{-1})^2)$. The method in the proof is applicable to general single timescale bilevel optimization problems. We also numerically validate our theoretical results on the convergence.
Multi-source Learning via Completion of Block-wise Overlapping Noisy Matrices
http://jmlr.org/papers/v24/22-0642.html
http://jmlr.org/papers/volume24/22-0642/22-0642.pdf
2023Doudou Zhou, Tianxi Cai, Junwei Lu
Electronic healthcare records (EHR) provide a rich resource for healthcare research. An important problem for the efficient utilization of the EHR data is the representation of the EHR features, which include the unstructured clinical narratives and the structured codified data. Matrix factorization-based embeddings trained using the summary-level co-occurrence statistics of EHR data have provided a promising solution for feature representation while preserving patients' privacy. However, such methods do not work well with multi-source data when these sources have overlapping but non-identical features. To accommodate multi-sources learning, we propose a novel word embedding generative model. To obtain multi-source embeddings, we design an efficient Block-wise Overlapping Noisy Matrix Integration (BONMI) algorithm to aggregate the multi-source pointwise mutual information matrices optimally with a theoretical guarantee. Our algorithm can also be applied to other multi-source data integration problems with a similar data structure. A by-product of BONMI is the contribution to the field of matrix completion by considering the missing mechanism other than the entry-wise independent missing. We show that the entry-wise missing assumption, despite its prevalence in the works of matrix completion, is not necessary to guarantee recovery. We prove the statistical rate of our estimator, which is comparable to the rate under independent missingness. Simulation studies show that BONMI performs well under a variety of configurations. We further illustrate the utility of BONMI by integrating multi-lingual multi-source medical text and EHR data to perform two tasks: (i) co-training semantic embeddings for medical concepts in both English and Chinese and (ii) the translation between English and Chinese medical concepts. Our method shows an advantage over existing methods.
A Unified Framework for Factorizing Distributional Value Functions for Multi-Agent Reinforcement Learning
http://jmlr.org/papers/v24/22-0630.html
http://jmlr.org/papers/volume24/22-0630/22-0630.pdf
2023Wei-Fang Sun, Cheng-Kuang Lee, Simon See, Chun-Yi Lee
In fully cooperative multi-agent reinforcement learning (MARL) settings, environments are highly stochastic due to the partial observability of each agent and the continuously changing policies of other agents. To address the above issues, we proposed a unified framework, called DFAC, for integrating distributional RL with value function factorization methods. This framework generalizes expected value function factorization methods to enable the factorization of return distributions. To validate DFAC, we first demonstrate its ability to factorize the value functions of a simple matrix game with stochastic rewards. Then, we perform experiments on all Super Hard maps of the StarCraft Multi-Agent Challenge and six self-designed Ultra Hard maps, showing that DFAC is able to outperform a number of baselines.
Functional L-Optimality Subsampling for Functional Generalized Linear Models with Massive Data
http://jmlr.org/papers/v24/22-0614.html
http://jmlr.org/papers/volume24/22-0614/22-0614.pdf
2023Hua Liu, Jinhong You, Jiguo Cao
Massive data bring the big challenges of memory and computation for analysis. These challenges can be tackled by taking subsamples from the full data as a surrogate. For functional data, it is common to collect multiple measurements over their domains, which require even more memory and computation time when the sample size is large. The computation would be much more intensive when statistical inference is required through bootstrap samples. Motivated by analyzing large-scale kidney transplant data, we propose an optimal subsampling method based on the functional L-optimality criterion for functional generalized linear models. To the best of our knowledge, this is the first attempt to propose a subsampling method for functional data analysis. The asymptotic properties of the resultant estimators are also established. The analysis results from extensive simulation studies and from the kidney transplant data show that the functional L-optimality subsampling (FLoS) method is much better than the uniform subsampling approach and can well approximate the results based on the full data while dramatically reducing the computation time and memory.
Adaptation Augmented Model-based Policy Optimization
http://jmlr.org/papers/v24/22-0606.html
http://jmlr.org/papers/volume24/22-0606/22-0606.pdf
2023Jian Shen, Hang Lai, Minghuan Liu, Han Zhao, Yong Yu, Weinan Zhang
Compared to model-free reinforcement learning (RL), model-based RL is often more sample efficient by leveraging a learned dynamics model to help decision making. However, the learned model is usually not perfectly accurate and the error will compound in multi-step predictions, which can lead to poor asymptotic performance. In this paper, we first derive an upper bound of the return discrepancy between the real dynamics and the learned model, which reveals the fundamental problem of distribution shift between simulated data and real data. Inspired by the theoretical analysis, we propose an adaptation augmented model-based policy optimization (AMPO) framework to address the distribution shift problem from the perspectives of feature learning and instance re-weighting, respectively. Specifically, the feature-based variant, namely FAMPO, introduces unsupervised model adaptation to minimize the integral probability metric (IPM) between feature distributions from real and simulated data, while the instance-based variant, termed as IAMPO, utilizes importance sampling to re-weight the real samples used to train the model. Besides model learning, we also investigate how to improve policy optimization in the model usage phase by selecting simulated samples with different probability according to their uncertainty. Extensive experiments on challenging continuous control tasks show that FAMPO and IAMPO, coupled with our model usage technique, achieves superior performance against baselines, which demonstrates the effectiveness of the proposed methods.
GANs as Gradient Flows that Converge
http://jmlr.org/papers/v24/22-0583.html
http://jmlr.org/papers/volume24/22-0583/22-0583.pdf
2023Yu-Jui Huang, Yuchong Zhang
This paper approaches the unsupervised learning problem by gradient descent in the space of probability density functions. A main result shows that along the gradient flow induced by a distribution-dependent ordinary differential equation (ODE), the unknown data distribution emerges as the long-time limit. That is, one can uncover the data distribution by simulating the distribution-dependent ODE. Intriguingly, the simulation of the ODE is shown equivalent to the training of generative adversarial networks (GANs). This equivalence provides a new "cooperative" view of GANs and, more importantly, sheds new light on the divergence of GANs. In particular, it reveals that the GAN algorithm implicitly minimizes the mean squared error (MSE) between two sets of samples, and this MSE fitting alone can cause GANs to diverge. To construct a solution to the distribution-dependent ODE, we first show that the associated nonlinear Fokker-Planck equation has a unique weak solution, by the Crandall-Liggett theorem for differential equations in Banach spaces. Based on this solution to the Fokker-Planck equation, we construct a unique solution to the ODE, using Trevisan's superposition principle. The convergence of the induced gradient flow to the data distribution is obtained by analyzing the Fokker-Planck equation.
Random Forests for Change Point Detection
http://jmlr.org/papers/v24/22-0512.html
http://jmlr.org/papers/volume24/22-0512/22-0512.pdf
2023Malte Londschien, Peter Bühlmann, Solt Kovács
We propose a novel multivariate nonparametric multiple change point detection method using classifiers. We construct a classifier log-likelihood ratio that uses class probability predictions to compare different change point configurations. We propose a computationally feasible search method that is particularly well suited for random forests, denoted by changeforest. However, the method can be paired with any classifier that yields class probability predictions, which we illustrate by also using a $k$-nearest neighbor classifier. We prove that it consistently locates change points in single change point settings when paired with a consistent classifier. Our proposed method changeforest achieves improved empirical performance in an extensive simulation study compared to existing multivariate nonparametric change point detection methods. An efficient implementation of our method is made available for R, Python, and Rust users in the changeforest software package.
Least Squares Model Averaging for Distributed Data
http://jmlr.org/papers/v24/22-0511.html
http://jmlr.org/papers/volume24/22-0511/22-0511.pdf
2023Haili Zhang, Zhaobo Liu, Guohua Zou
Divide and conquer algorithm is a common strategy applied in big data. Model averaging has the natural divide-and-conquer feature, but its theory has not been developed in big data scenarios. The goal of this paper is to fill this gap. We propose two divide-and-conquer-type model averaging estimators for linear models with distributed data. Under some regularity conditions, we show that the weights from Mallows model averaging criterion converge in L2 to the theoretically optimal weights minimizing the risk of the model averaging estimator. We also give the bounds of the in-sample and out-of-sample mean squared errors and prove the asymptotic optimality for the proposed model averaging estimators. Our conclusions hold even when the dimensions and the number of candidate models are divergent. Simulation results and a real airline data analysis illustrate that the proposed model averaging methods perform better than the commonly used model selection and model averaging methods in distributed data cases. Our approaches contribute to model averaging theory in distributed data and parallel computations, and can be applied in big data analysis to save time and reduce the computational burden.
An Empirical Investigation of the Role of Pre-training in Lifelong Learning
http://jmlr.org/papers/v24/22-0496.html
http://jmlr.org/papers/volume24/22-0496/22-0496.pdf
2023Sanket Vaibhav Mehta, Darshan Patil, Sarath Chandar, Emma Strubell
The lifelong learning paradigm in machine learning is an attractive alternative to the more prominent isolated learning scheme not only due to its resemblance to biological learning but also its potential to reduce energy waste by obviating excessive model re-training. A key challenge to this paradigm is the phenomenon of catastrophic forgetting. With the increasing popularity and success of pre-trained models in machine learning, we pose the question: What role does pre-training play in lifelong learning, specifically with respect to catastrophic forgetting? We investigate existing methods in the context of large, pre-trained models and evaluate their performance on a variety of text and image classification tasks, including a large-scale study using a novel data set of 15 diverse NLP tasks. Across all settings, we observe that generic pre-training implicitly alleviates the effects of catastrophic forgetting when learning multiple tasks sequentially compared to randomly initialized models. We then further investigate why pre-training alleviates forgetting in this setting. We study this phenomenon by analyzing the loss landscape, finding that pre-trained weights appear to ease forgetting by leading to wider minima. Based on this insight, we propose jointly optimizing for current task loss and loss basin sharpness to explicitly encourage wider basins during sequential fine-tuning. We show that this optimization approach outperforms several state-of-the-art task-sequential continual learning algorithms across multiple settings, occasionally even without retaining a memory that scales in size with the number of tasks.
Polynomial-Time Algorithms for Counting and Sampling Markov Equivalent DAGs with Applications
http://jmlr.org/papers/v24/22-0495.html
http://jmlr.org/papers/volume24/22-0495/22-0495.pdf
2023Marcel Wienöbst, Max Bannach, Maciej Liśkiewicz
Counting and sampling directed acyclic graphs from a Markov equivalence class are fundamental tasks in graphical causal analysis. In this paper we show that these tasks can be performed in polynomial time, solving a long-standing open problem in this area. Our algorithms are effective and easily implementable. As we show in experiments, these breakthroughs make thought-to-be-infeasible strategies in active learning of causal structures and causal effect identification with regard to a Markov equivalence class practically applicable.
An Inexact Augmented Lagrangian Algorithm for Training Leaky ReLU Neural Network with Group Sparsity
http://jmlr.org/papers/v24/22-0491.html
http://jmlr.org/papers/volume24/22-0491/22-0491.pdf
2023Wei Liu, Xin Liu, Xiaojun Chen
The leaky ReLU network with a group sparse regularization term has been widely used in the recent years. However, training such network yields a nonsmooth nonconvex optimization problem and there exists a lack of approaches to compute a stationary point deterministically. In this paper, we first resolve the multi-layer composite term in the original optimization problem by introducing auxiliary variables and additional constraints. We show the new model has a nonempty and bounded solution set and its feasible set satisfies the Mangasarian-Fromovitz constraint qualification. Moreover, we show the relationship between the new model and the original problem. Remarkably, we propose an inexact augmented Lagrangian algorithm for solving the new model, and show the convergence of the algorithm to a KKT point. Numerical experiments demonstrate that our algorithm is more efficient for training sparse leaky ReLU neural networks than some well-known algorithms.
Entropic Fictitious Play for Mean Field Optimization Problem
http://jmlr.org/papers/v24/22-0411.html
http://jmlr.org/papers/volume24/22-0411/22-0411.pdf
2023Fan Chen, Zhenjie Ren, Songbo Wang
We study two-layer neural networks in the mean field limit, where the number of neurons tends to infinity. In this regime, the optimization over the neuron parameters becomes the optimization over the probability measures, and by adding an entropic regularizer, the minimizer of the problem is identified as a fixed point. We propose a novel training algorithm named entropic fictitious play, inspired by the classical fictitious play in game theory for learning Nash equilibriums, to recover this fixed point, and the algorithm exhibits a two-loop iteration structure. Exponential convergence is proved in this paper and we also verify our theoretical results by simple numerical examples.
GFlowNet Foundations
http://jmlr.org/papers/v24/22-0364.html
http://jmlr.org/papers/volume24/22-0364/22-0364.pdf
2023Yoshua Bengio, Salem Lahlou, Tristan Deleu, Edward J. Hu, Mo Tiwari, Emmanuel Bengio
Generative Flow Networks (GFlowNets) have been introduced as a method to sample a diverse set of candidates in an active learning context, with a training objective that makes them approximately sample in proportion to a given reward function. In this paper, we show a number of additional theoretical properties of GFlowNets, including a new local and efficient training objective called detailed balance for the analogy with MCMC. GFlowNets can be used to estimate joint probability distributions and the corresponding marginal distributions where some variables are unspecified and, of particular interest, can represent distributions over composite objects like sets and graphs. GFlowNets amortize the work typically done by computationally expensive MCMC methods in a single but trained generative pass. They could also be used to estimate partition functions and free energies, conditional probabilities of supersets (supergraphs) given a subset (subgraph), as well as marginal distributions over all supersets (supergraphs) of a given set (graph). We introduce variations enabling the estimation of entropy and mutual information, continuous actions and modular energy functions.
LibMTL: A Python Library for Deep Multi-Task Learning
http://jmlr.org/papers/v24/22-0347.html
http://jmlr.org/papers/volume24/22-0347/22-0347.pdf
2023Baijiong Lin, Yu Zhang
This paper presents LibMTL, an open-source Python library built on PyTorch, which provides a unified, comprehensive, reproducible, and extensible implementation framework for Multi-Task Learning (MTL). LibMTL considers different settings and approaches in MTL, and it supports a large number of state-of-the-art MTL methods, including 13 optimization strategies and 8 architectures. Moreover, the modular design in LibMTL makes it easy to use and well-extensible, thus users can easily and fast develop new MTL methods, compare with existing MTL methods fairly, or apply MTL algorithms to real-world applications with the support of LibMTL. The source code and detailed documentations of LibMTL are available at https://github.com/median-research-group/LibMTL and https://libmtl.readthedocs.io, respectively.
Minimax Risk Classifiers with 0-1 Loss
http://jmlr.org/papers/v24/22-0339.html
http://jmlr.org/papers/volume24/22-0339/22-0339.pdf
2023Santiago Mazuelas, Mauricio Romero, Peter Grunwald
Supervised classification techniques use training samples to learn a classification rule with small expected 0-1 loss (error probability). Conventional methods enable tractable learning and provide out-of-sample generalization by using surrogate losses instead of the 0-1 loss and considering specific families of rules (hypothesis classes). This paper presents minimax risk classifiers (MRCs) that minimize the worst-case 0-1 loss with respect to uncertainty sets of distributions that can include the underlying distribution, with a tunable confidence. We show that MRCs can provide tight performance guarantees at learning and are strongly universally consistent using feature mappings given by characteristic kernels. The paper also proposes efficient optimization techniques for MRC learning and shows that the methods presented can provide accurate classification together with tight performance guarantees in practice.
Augmented Sparsifiers for Generalized Hypergraph Cuts
http://jmlr.org/papers/v24/22-0305.html
http://jmlr.org/papers/volume24/22-0305/22-0305.pdf
2023Nate Veldt, Austin R. Benson, Jon Kleinberg
Hypergraph generalizations of many graph cut problems and algorithms have recently been introduced to better model data and systems characterized by multiway relationships. Recent work in machine learning and theoretical computer science uses a generalized cut function for a hypergraph $\mathcal{H} = (V,\mathcal{E})$ that associates each hyperedge $e \in \mathcal{E}$ with a splitting function ${\bf w}_e$, which assigns a penalty to each way of separating the nodes of $e$. When each ${\bf w}_e$ satisfies ${\bf w}_e(S) = g(\lvert S \rvert)$ for some concave function $g$, previous work has shown how to reduce the generalized hypergraph cut problem to a directed graph cut problem, although the resulting graph may be very dense. We introduce a framework for sparsifying hypergraph-to-graph reductions, where the hypergraph cut function is $(1+\varepsilon)$-approximated by a cut on a directed graph. For $\varepsilon > 0$ we need at most $O(\varepsilon^{-1}|e| \log |e|)$ edges to reduce any hyperedge $e$, while only $O(|e| \varepsilon^{-1/2} \log \log \frac{1}{\varepsilon})$ edges are needed to approximate the clique expansion, a widely used heuristic in hypergraph clustering. Our framework leads to improved results for solving cut problems in co-occurrence graphs, decomposable submodular function minimization problems, and localized hypergraph clustering problems.
Non-stationary Online Learning with Memory and Non-stochastic Control
http://jmlr.org/papers/v24/22-0218.html
http://jmlr.org/papers/volume24/22-0218/22-0218.pdf
2023Peng Zhao, Yu-Hu Yan, Yu-Xiang Wang, Zhi-Hua Zhou
We study the problem of Online Convex Optimization (OCO) with memory, which allows loss functions to depend on past decisions and thus captures temporal effects of learning problems. In this paper, we introduce dynamic policy regret as the performance measure to design algorithms robust to non-stationary environments, which competes algorithms' decisions with a sequence of changing comparators. We propose a novel algorithm for OCO with memory that provably enjoys an optimal dynamic policy regret in terms of time horizon, non-stationarity measure, and memory length. The key technical challenge is how to control the switching cost, the cumulative movements of player's decisions, which is neatly addressed by a novel switching-cost-aware online ensemble approach equipped with a new meta-base decomposition of dynamic policy regret and a careful design of meta-learner and base-learner that explicitly regularizes the switching cost. The results are further applied to tackle non-stationarity in online non-stochastic control (Agarwal et al., 2019), i.e., controlling a linear dynamical system with adversarial disturbance and convex cost functions. We derive a novel gradient-based controller with dynamic policy regret guarantees, which is the first controller provably competitive to a sequence of changing policies for online non-stochastic control.
L0Learn: A Scalable Package for Sparse Learning using L0 Regularization
http://jmlr.org/papers/v24/22-0189.html
http://jmlr.org/papers/volume24/22-0189/22-0189.pdf
2023Hussein Hazimeh, Rahul Mazumder, Tim Nonet
We present L0Learn: an open-source package for sparse linear regression and classification using $\ell_0$ regularization. L0Learn implements scalable, approximate algorithms, based on coordinate descent and local combinatorial optimization. The package is built using C++ and has user-friendly R and Python interfaces. L0Learn can address problems with millions of features, achieving competitive run times and statistical performance with state-of-the-art sparse learning packages. L0Learn is available on both CRAN and GitHub.
Buffered Asynchronous SGD for Byzantine Learning
http://jmlr.org/papers/v24/22-0184.html
http://jmlr.org/papers/volume24/22-0184/22-0184.pdf
2023Yi-Rui Yang, Wu-Jun Li
Distributed learning has become a hot research topic due to its wide application in cluster-based large-scale learning, federated learning, edge computing, and so on. Most traditional distributed learning methods typically assume no failure or attack. However, many unexpected cases, such as communication failure and even malicious attack, may happen in real applications. Hence, Byzantine learning (BL), which refers to distributed learning with failure or attack, has recently attracted much attention. Most existing BL methods are synchronous, which are impractical in some applications due to heterogeneous or offline workers. In these cases, asynchronous BL (ABL) is usually preferred. In this paper, we propose a novel method, called buffered asynchronous stochastic gradient descent (BASGD), for ABL. To the best of our knowledge, BASGD is the first ABL method that can resist non-omniscient attacks without storing any instances on the server. Furthermore, we also propose an improved variant of BASGD, called BASGD with momentum (BASGDm), by introducing local momentum into BASGD. Compared with those methods which need to store instances on server, BASGD and BASGDm have a wider scope of application. Both BASGD and BASGDm are compatible with various aggregation rules. Moreover, both BASGD and BASGDm are proved to be convergent and able to resist failure or attack. Empirical results show that our methods significantly outperform existing ABL baselines when there exists failure or attack on workers.
A Non-parametric View of FedAvg and FedProx:Beyond Stationary Points
http://jmlr.org/papers/v24/22-0153.html
http://jmlr.org/papers/volume24/22-0153/22-0153.pdf
2023Lili Su, Jiaming Xu, Pengkun Yang
Federated Learning (FL) is a promising decentralized learning framework and has great potentials in privacy preservation and in lowering the computation load at the cloud. Recent work showed that FedAvg and FedProx -- the two widely-adopted FL algorithms -- fail to reach the stationary points of the global optimization objective even for homogeneous linear regression problems. Further, it is concerned that the common model learned might not generalize well locally at all in the presence of heterogeneity. In this paper, we analyze the convergence and statistical efficiency of FedAvg and FedProx, addressing the above two concerns. Our analysis is based on the standard non-parametric regression in a reproducing kernel Hilbert space (RKHS), and allows for heterogeneous local data distributions and unbalanced local datasets. We prove that the estimation errors, measured in either the empirical norm or the RKHS norm, decay with a rate of $1/t$ in general and exponentially for finite-rank kernels. In certain heterogeneous settings, these upper bounds also imply that both FedAvg and FedProx achieve the optimal error rate. To further analytically quantify the impact of the heterogeneity at each client, we propose and characterize a novel notion-federation gain, defined as the reduction of the estimation error for a client to join the FL. We discover that when the data heterogeneity is moderate, a client with limited local data can benefit from a common model with a large federation gain. Two new insights introduced by considering the statistical aspect are: (1) requiring the standard bounded dissimilarity is pessimistic for the convergence analysis of FedAvg and FedProx; (2) despite inconsistency of stationary points, their limiting points are unbiased estimators of the underlying truth. Numerical experiments further corroborate our theoretical findings.
Multiplayer Performative Prediction: Learning in Decision-Dependent Games
http://jmlr.org/papers/v24/22-0131.html
http://jmlr.org/papers/volume24/22-0131/22-0131.pdf
2023Adhyyan Narang, Evan Faulkner, Dmitriy Drusvyatskiy, Maryam Fazel, Lillian J. Ratliff
Learning problems commonly exhibit an interesting feedback mechanism wherein the population data reacts to competing decision makers' actions. This paper formulates a new game theoretic framework for this phenomenon, called multi-player performative prediction. We focus on two distinct solution concepts, namely (i) performatively stable equilibria and (ii) Nash equilibria of the game. The latter equilibria are arguably more informative, but are generally computationally difficult to find since they are solutions of non-monotone games. We show that under mild assumptions, the performatively stable equilibria can be found efficiently by a variety of algorithms, including repeated retraining and the repeated (stochastic) gradient method. We then establish transparent sufficient conditions for strong monotonicity of the game and use them to develop algorithms for finding Nash equilibria. We investigate derivative free methods and adaptive gradient algorithms wherein each player alternates between learning a parametric description of their distribution and gradient steps on the empirical risk. Synthetic and semi-synthetic numerical experiments illustrate the results.
Variational Inverting Network for Statistical Inverse Problems of Partial Differential Equations
http://jmlr.org/papers/v24/22-0006.html
http://jmlr.org/papers/volume24/22-0006/22-0006.pdf
2023Junxiong Jia, Yanni Wu, Peijun Li, Deyu Meng
To quantify uncertainties in inverse problems of partial differential equations (PDEs), we formulate them into statistical inference problems using Bayes' formula. Recently, well-justified infinite-dimensional Bayesian analysis methods have been developed to construct dimension-independent algorithms. However, there are three challenges for these infinite-dimensional Bayesian methods: prior measures usually act as regularizers and are not able to incorporate prior information efficiently; complex noises, such as more practical non-i.i.d. distributed noises, are rarely considered; and time-consuming forward PDE solvers are needed to estimate posterior statistical quantities. To address these issues, an infinite-dimensional inference framework has been proposed based on the infinite-dimensional variational inference method and deep generative models. Specifically, by introducing some measure equivalence assumptions, we derive the evidence lower bound in the infinite-dimensional setting and provide possible parametric strategies that yield a general inference framework called the Variational Inverting Network (VINet). This inference framework can encode prior and noise information from learning examples. In addition, relying on the power of deep neural networks, the posterior mean and variance can be efficiently and explicitly generated in the inference stage. In numerical experiments, we design specific network structures that yield a computable VINet from the general inference framework. Numerical examples of linear inverse problems of an elliptic equation and the Helmholtz equation are presented to illustrate the effectiveness of the proposed inference framework.
Model-based Causal Discovery for Zero-Inflated Count Data
http://jmlr.org/papers/v24/21-1476.html
http://jmlr.org/papers/volume24/21-1476/21-1476.pdf
2023Junsouk Choi, Yang Ni
Zero-inflated count data arise in a wide range of scientific areas such as social science, biology, and genomics. Very few causal discovery approaches can adequately account for excessive zeros as well as various features of multivariate count data such as overdispersion. In this paper, we propose a new zero-inflated generalized hypergeometric directed acyclic graph (ZiG-DAG) model for inference of causal structure from purely observational zero-inflated count data. The proposed ZiG-DAGs exploit a broad family of generalized hypergeometric probability distributions and are useful for modeling various types of zero-inflated count data with great flexibility. In addition, ZiG-DAGs allow for both linear and nonlinear causal relationships. We prove that the causal structure is identifiable for the proposed ZiG-DAGs via a general proof technique for count data, which is applicable beyond the proposed model for investigating causal identifiability. Score-based algorithms are developed for causal structure learning. Extensive synthetic experiments as well as a real dataset with known ground truth demonstrate the superior performance of the proposed method against state-of-the-art alternative methods in discovering causal structure from observational zero-inflated count data. An application of reverse-engineering a gene regulatory network from a single-cell RNA-sequencing dataset illustrates the utility of ZiG-DAGs in practice.
Q-Learning for MDPs with General Spaces: Convergence and Near Optimality via Quantization under Weak Continuity
http://jmlr.org/papers/v24/21-1457.html
http://jmlr.org/papers/volume24/21-1457/21-1457.pdf
2023Ali Kara, Naci Saldi, Serdar Yüksel
Reinforcement learning algorithms often require finiteness of state and action spaces in Markov decision processes (MDPs) (also called controlled Markov chains) and various efforts have been made in the literature towards the applicability of such algorithms for continuous state and action spaces. In this paper, we show that under very mild regularity conditions (in particular, involving only weak continuity of the transition kernel of an MDP), Q-learning for standard Borel MDPs via quantization of states and actions (called Quantized Q-Learning) converges to a limit, and furthermore this limit satisfies an optimality equation which leads to near optimality with either explicit performance bounds or which are guaranteed to be asymptotically optimal. Our approach builds on (i) viewing quantization as a measurement kernel and thus a quantized MDP as a partially observed Markov decision process (POMDP), (ii) utilizing near optimality and convergence results of Q-learning for POMDPs, and (iii) finally, near-optimality of finite state model approximations for MDPs with weakly continuous kernels which we show to correspond to the fixed point of the constructed POMDP. Thus, our paper presents a very general convergence and approximation result for the applicability of Q-learning for continuous MDPs.
CodaLab Competitions: An Open Source Platform to Organize Scientific Challenges
http://jmlr.org/papers/v24/21-1436.html
http://jmlr.org/papers/volume24/21-1436/21-1436.pdf
2023Adrien Pavao, Isabelle Guyon, Anne-Catherine Letournel, Dinh-Tuan Tran, Xavier Baro, Hugo Jair Escalante, Sergio Escalera, Tyler Thomas, Zhen Xu
CodaLab Competitions is an open source web platform designed to help data scientists and research teams to crowd-source the resolution of machine learning problems through the organization of competitions, also called challenges or contests. CodaLab Competitions provides useful features such as multiple phases, results and code submissions, multi-score leaderboards, and jobs running inside Docker containers. The platform is very flexible and can handle large scale experiments, by allowing organizers to upload large datasets and provide their own CPU or GPU compute workers.
Contrasting Identifying Assumptions of Average Causal Effects: Robustness and Semiparametric Efficiency
http://jmlr.org/papers/v24/21-1392.html
http://jmlr.org/papers/volume24/21-1392/21-1392.pdf
2023Tetiana Gorbach, Xavier de Luna, Juha Karvanen, Ingeborg Waernbaum
Semiparametric inference on average causal effects from observational data is based on assumptions yielding identification of the effects. In practice, several distinct identifying assumptions may be plausible; an analyst has to make a delicate choice between these models. In this paper, we study three identifying assumptions based on the potential outcome framework: the back-door assumption, which uses pre-treatment covariates, the front-door assumption, which uses mediators, and the two-door assumption using pre-treatment covariates and mediators simultaneously. We provide the efficient influence functions and the corresponding semiparametric efficiency bounds that hold under these assumptions, and their combinations. We demonstrate that neither of the identification models provides uniformly the most efficient estimation and give conditions under which some bounds are lower than others. We show when semiparametric estimating equation estimators based on influence functions attain the bounds, and study the robustness of the estimators to misspecification of the nuisance models. The theory is complemented with simulation experiments on the finite sample behavior of the estimators. The results obtained are relevant for an analyst facing a choice between several plausible identifying assumptions and corresponding estimators. Our results show that this choice implies a trade-off between efficiency and robustness to misspecification of the nuisance models.
Variational Gibbs Inference for Statistical Model Estimation from Incomplete Data
http://jmlr.org/papers/v24/21-1373.html
http://jmlr.org/papers/volume24/21-1373/21-1373.pdf
2023Vaidotas Simkus, Benjamin Rhodes, Michael U. Gutmann
Statistical models are central to machine learning with broad applicability across a range of downstream tasks. The models are controlled by free parameters that are typically estimated from data by maximum-likelihood estimation or approximations thereof. However, when faced with real-world data sets many of the models run into a critical issue: they are formulated in terms of fully-observed data, whereas in practice the data sets are plagued with missing data. The theory of statistical model estimation from incomplete data is conceptually similar to the estimation of latent-variable models, where powerful tools such as variational inference (VI) exist. However, in contrast to standard latent-variable models, parameter estimation with incomplete data often requires estimating exponentially-many conditional distributions of the missing variables, hence making standard VI methods intractable. We address this gap by introducing variational Gibbs inference (VGI), a new general-purpose method to estimate the parameters of statistical models from incomplete data. We validate VGI on a set of synthetic and real-world estimation tasks, estimating important machine learning models such as variational autoencoders and normalising flows from incomplete data. The proposed method, whilst general-purpose, achieves competitive or better performance than existing model-specific estimation methods.
Clustering and Structural Robustness in Causal Diagrams
http://jmlr.org/papers/v24/21-1322.html
http://jmlr.org/papers/volume24/21-1322/21-1322.pdf
2023Santtu Tikka, Jouni Helske, Juha Karvanen
Graphs are commonly used to represent and visualize causal relations. For a small number of variables, this approach provides a succinct and clear view of the scenario at hand. As the number of variables under study increases, the graphical approach may become impractical, and the clarity of the representation is lost. Clustering of variables is a natural way to reduce the size of the causal diagram, but it may erroneously change the essential properties of the causal relations if implemented arbitrarily. We define a specific type of cluster, called transit cluster, that is guaranteed to preserve the identifiability properties of causal effects under certain conditions. We provide a sound and complete algorithm for finding all transit clusters in a given graph and demonstrate how clustering can simplify the identification of causal effects. We also study the inverse problem, where one starts with a clustered graph and looks for extended graphs where the identifiability properties of causal effects remain unchanged. We show that this kind of structural robustness is closely related to transit clusters.
MMD Aggregated Two-Sample Test
http://jmlr.org/papers/v24/21-1289.html
http://jmlr.org/papers/volume24/21-1289/21-1289.pdf
2023Antonin Schrab, Ilmun Kim, Mélisande Albert, Béatrice Laurent, Benjamin Guedj, Arthur Gretton
We propose two novel nonparametric two-sample kernel tests based on the Maximum Mean Discrepancy (MMD). First, for a fixed kernel, we construct an MMD test using either permutations or a wild bootstrap, two popular numerical procedures to determine the test threshold. We prove that this test controls the probability of type I error non-asymptotically. Hence, it can be used reliably even in settings with small sample sizes as it remains well-calibrated, which differs from previous MMD tests which only guarantee correct test level asymptotically. When the difference in densities lies in a Sobolev ball, we prove minimax optimality of our MMD test with a specific kernel depending on the smoothness parameter of the Sobolev ball. In practice, this parameter is unknown and, hence, the optimal MMD test with this particular kernel cannot be used. To overcome this issue, we construct an aggregated test, called MMDAgg, which is adaptive to the smoothness parameter. The test power is maximised over the collection of kernels used, without requiring held-out data for kernel selection (which results in a loss of test power), or arbitrary kernel choices such as the median heuristic. We prove that MMDAgg still controls the level non-asymptotically, and achieves the minimax rate over Sobolev balls, up to an iterated logarithmic term. Our guarantees are not restricted to a specific type of kernel, but hold for any product of one-dimensional translation invariant characteristic kernels. We provide a user-friendly parameter-free implementation of MMDAgg using an adaptive collection of bandwidths. We demonstrate that MMDAgg significantly outperforms alternative state-of-the-art MMD-based two-sample tests on synthetic data satisfying the Sobolev smoothness assumption, and that, on real-world image data, MMDAgg closely matches the power of tests leveraging the use of models such as neural networks.
Divide-and-Conquer Fusion
http://jmlr.org/papers/v24/21-1274.html
http://jmlr.org/papers/volume24/21-1274/21-1274.pdf
2023Ryan S.Y. Chan, Murray Pollock, Adam M. Johansen, Gareth O. Roberts
Combining several (sample approximations of) distributions, which we term sub-posteriors, into a single distribution proportional to their product, is a common challenge. Occurring, for instance, in distributed 'big data' problems, or when working under multi-party privacy constraints. Many existing approaches resort to approximating the individual sub-posteriors for practical necessity, then find either an analytical approximation or sample approximation of the resulting (product-pooled) posterior. The quality of the posterior approximation for these approaches is poor when the sub-posteriors fall out-with a narrow range of distributional form, such as being approximately Gaussian. Recently, a Fusion approach has been proposed which finds an exact Monte Carlo approximation of the posterior, circumventing the drawbacks of approximate approaches. Unfortunately, existing Fusion approaches have a number of computational limitations, particularly when unifying a large number of sub-posteriors. In this paper, we generalise the theory underpinning existing Fusion approaches, and embed the resulting methodology within a recursive divide-and-conquer sequential Monte Carlo paradigm. This ultimately leads to a competitive Fusion approach, which is robust to increasing numbers of sub-posteriors.
PAC-learning for Strategic Classification
http://jmlr.org/papers/v24/21-1250.html
http://jmlr.org/papers/volume24/21-1250/21-1250.pdf
2023Ravi Sundaram, Anil Vullikanti, Haifeng Xu, Fan Yao
The study of strategic or adversarial manipulation of testing data to fool a classifier has attracted much recent attention. Most previous works have focused on two extreme situations where any testing data point either is completely adversarial or always equally prefers the positive label. In this paper, we generalize both of these through a unified framework by considering strategic agents with heterogenous preferences, and introduce the notion of strategic VC-dimension (SVC) to capture the PAC-learnability in our general strategic setup. SVC provably generalizes the recent concept of adversarial VC-dimension (AVC) introduced by Cullina et al. (2018). We instantiate our framework for the fundamental strategic linear classification problem. We fully characterize: (1) the statistical learnability of linear classifiers by pinning down its SVC; (2) its computational tractability by pinning down the complexity of the empirical risk minimization problem. Interestingly, the SVC of linear classifiers is always upper bounded by its standard VC-dimension. This characterization also strictly generalizes the AVC bound for linear classifiers in (Cullina et al., 2018). Finally, we briefly investigate the power of randomization in our strategic classification setup. We show that randomization may strictly increase the accuracy in general, but will not help in the special case of adversarial classification with zero-manipulation-cost.
Insights into Ordinal Embedding Algorithms: A Systematic Evaluation
http://jmlr.org/papers/v24/21-1170.html
http://jmlr.org/papers/volume24/21-1170/21-1170.pdf
2023Leena Chennuru Vankadara, Michael Lohaus, Siavash Haghiri, Faiz Ul Wahab, Ulrike von Luxburg
The objective of ordinal embedding is to find a Euclidean representation of a set of abstract items, using only answers to triplet comparisons of the form “Is item $i$ closer to item $j$ or item $k$?”. In recent years, numerous algorithms have been proposed to solve this problem. However, there does not exist a fair and thorough assessment of these embedding methods and therefore several key questions remain unanswered: Which algorithms perform better when the embedding dimension is constrained or few triplet comparisons are available? Which ones scale better with increasing sample size or dimension? In our paper, we address these questions and provide an extensive and systematic empirical evaluation of existing algorithms as well as a new neural network approach. We find that simple, relatively unknown, non-convex methods consistently outperform all other algorithms across a broad range of tasks including more recent and elaborate methods based on neural networks or landmark approaches. This finding can be explained by the insight that many of the non-convex optimization approaches do not suffer from local optima. Our comprehensive assessment is enabled by our unified library of popular embedding algorithms that leverages GPU resources and allows for fast and accurate embeddings of millions of data points.
Clustering with Tangles: Algorithmic Framework and Theoretical Guarantees
http://jmlr.org/papers/v24/21-1160.html
http://jmlr.org/papers/volume24/21-1160/21-1160.pdf
2023Solveig Klepper, Christian Elbracht, Diego Fioravanti, Jakob Kneip, Luca Rendsburg, Maximilian Teegen, Ulrike von Luxburg
Originally, tangles were invented as an abstract tool in mathematical graph theory to prove the famous graph minor theorem. In this paper, we showcase the practical potential of tangles in machine learning applications. Given a collection of cuts of any dataset, tangles aggregate these cuts to point in the direction of a dense structure. As a result, a cluster is softly characterized by a set of consistent pointers. This highly flexible approach can solve clustering problems in various setups, ranging from questionnaires over community detection in graphs to clustering points in metric spaces. The output of our proposed framework is hierarchical and induces the notion of a soft dendrogram, which can help explore the cluster structure of a dataset. The computational complexity of aggregating the cuts is linear in the number of data points. Thus the bottleneck of the tangle approach is to generate the cuts, for which simple and fast algorithms form a sufficient basis. In our paper we construct the algorithmic framework for clustering with tangles, prove theoretical guarantees in various settings, and provide extensive simulations and use cases. Python code is available on github.
Random Feature Neural Networks Learn Black-Scholes Type PDEs Without Curse of Dimensionality
http://jmlr.org/papers/v24/21-0987.html
http://jmlr.org/papers/volume24/21-0987/21-0987.pdf
2023Lukas Gonon
This article investigates the use of random feature neural networks for learning Kolmogorov partial (integro-)differential equations associated to Black-Scholes and more general exponential Lévy models. Random feature neural networks are single-hidden-layer feedforward neural networks in which the hidden weights are randomly generated and only the output weights are trainable. This makes training particularly simple, but (a priori) reduces expressivity. Interestingly, this is not the case for certain Black-Scholes type PDEs, as we show here. We derive bounds for the prediction error of random neural networks for learning sufficiently non-degenerate Black-Scholes type models. A full error analysis - bounding the approximation, generalization and optimization error of the algorithm - is provided and it is shown that the derived bounds do not suffer from the curse of dimensionality. We also investigate an application of these results to basket options and validate the bounds numerically. These results prove that neural networks are able to learn solutions to suitable Black-Scholes type PDEs without the curse of dimensionality. In addition, this provides an example of a relevant learning problem in which random feature neural networks are provably efficient.
The Proximal ID Algorithm
http://jmlr.org/papers/v24/21-0950.html
http://jmlr.org/papers/volume24/21-0950/21-0950.pdf
2023Ilya Shpitser, Zach Wood-Doughty, Eric J. Tchetgen Tchetgen
Unobserved confounding is a fundamental obstacle to establishing valid causal conclusions from observational data. Two complementary types of approaches have been developed to address this obstacle: obtaining identification using fortuitous external aids, such as instrumental variables or proxies, or by means of the ID algorithm, using Markov restrictions on the full data distribution encoded in graphical causal models. In this paper we aim to develop a synthesis of the former and latter approaches to identification in causal inference to yield the most general identification algorithm in multivariate systems currently known -- the proximal ID algorithm. In addition to being able to obtain nonparametric identification in all cases where the ID algorithm succeeds, our approach allows us to systematically exploit proxies to adjust for the presence of unobserved confounders that would have otherwise prevented identification. In addition, we outline a class of estimation strategies for causal parameters identified by our method in an important special case. We illustrate our approach by simulation studies and a data application.
Quantifying Network Similarity using Graph Cumulants
http://jmlr.org/papers/v24/21-082.html
http://jmlr.org/papers/volume24/21-082/21-082.pdf
2023Gecia Bravo-Hermsdorff, Lee M. Gunderson, Pierre-André Maugis, Carey E. Priebe
How might one test the hypothesis that networks were sampled from the same distribution? Here, we compare two statistical tests that use subgraph counts to address this question. The first uses the empirical subgraph densities themselves as estimates of those of the underlying distribution. The second test uses a new approach that converts these subgraph densities into estimates of the graph cumulants of the distribution (without any increase in computational complexity). We demonstrate --- via theory, simulation, and application to real data --- the superior statistical power of using graph cumulants. In summary, when analyzing data using subgraph/motif densities, we suggest using the corresponding graph cumulants instead.
Learning an Explicit Hyper-parameter Prediction Function Conditioned on Tasks
http://jmlr.org/papers/v24/21-0742.html
http://jmlr.org/papers/volume24/21-0742/21-0742.pdf
2023Jun Shu, Deyu Meng, Zongben Xu
Meta learning has attracted much attention recently in machine learning community. Contrary to conventional machine learning aiming to learn inherent prediction rules to predict labels for new query data, meta learning aims to learn the learning methodology for machine learning from observed tasks, so as to generalize to new query tasks by leveraging the meta-learned learning methodology. In this study, we achieve such learning methodology by learning an explicit hyper-parameter prediction function shared by all training tasks, and we call this learning process as Simulating Learning Methodology (SLeM). Specifically, this function is represented as a parameterized function called meta-learner, mapping from a training/test task to its suitable hyper-parameter setting, extracted from a pre-specified function set called meta learning machine. Such setting guarantees that the meta-learned learning methodology is able to flexibly fit diverse query tasks, instead of only obtaining fixed hyper-parameters by many current meta learning methods, with less adaptability to query task's variations. Such understanding of meta learning also makes it easily succeed from traditional learning theory for analyzing its generalization bounds with general losses/tasks/models. The theory naturally leads to some feasible controlling strategies for ameliorating the quality of the extracted meta-learner, verified to be able to finely ameliorate its generalization capability in some typical meta learning applications, including few-shot regression, few-shot classification and domain generalization. The source code of our method is released at https://github.com/xjtushujun/SLeM-Theory.
On the Theoretical Equivalence of Several Trade-Off Curves Assessing Statistical Proximity
http://jmlr.org/papers/v24/21-0607.html
http://jmlr.org/papers/volume24/21-0607/21-0607.pdf
2023Rodrigue Siry, Ryan Webster, Loic Simon, Julien Rabin
The recent advent of powerful generative models has triggered the renewed development of quantitative measures to assess the proximity of two probability distributions. As the scalar Frechet Inception Distance remains popular, several methods have explored computing entire curves, which reveal the trade-off between the fidelity and variability of the first distribution with respect to the second one. Several of such variants have been proposed independently and while intuitively similar, their relationship has not yet been made explicit. In an effort to make the emerging picture of generative evaluation more clear, we propose a unification of four curves known respectively as: the Precision-Recall (PR) curve, the Lorenz curve, the Receiver Operating Characteristic (ROC) curve and a special case of Rényi divergence frontiers. In addition, we discuss possible links between PR / Lorenz curves with the derivation of domain adaptation bounds.
Metrizing Weak Convergence with Maximum Mean Discrepancies
http://jmlr.org/papers/v24/21-0599.html
http://jmlr.org/papers/volume24/21-0599/21-0599.pdf
2023Carl-Johann Simon-Gabriel, Alessandro Barp, Bernhard Schölkopf, Lester Mackey
This paper characterizes the maximum mean discrepancies (MMD) that metrize the weak convergence of probability measures for a wide class of kernels. More precisely, we prove that, on a locally compact, non-compact, Hausdorff space, the MMD of a bounded continuous Borel measurable kernel $k$, whose RKHS-functions vanish at infinity (i.e., $H_k \subset C_0$), metrizes the weak convergence of probability measures if and only if $k$ is continuous and integrally strictly positive definite ($\int$s.p.d.) over all signed, finite, regular Borel measures. We also correct a prior result of Simon-Gabriel and Schölkopf (JMLR 2018, Thm. 12) by showing that there exist both bounded continuous $\int$s.p.d. kernels that do not metrize weak convergence and bounded continuous non-$\int$s.p.d. kernels that do metrize it.
Quasi-Equivalence between Width and Depth of Neural Networks
http://jmlr.org/papers/v24/21-0579.html
http://jmlr.org/papers/volume24/21-0579/21-0579.pdf
2023Fenglei Fan, Rongjie Lai, Ge Wang
While classic studies proved that wide networks allow universal approximation, recent research and successes of deep learning demonstrate the power of deep networks. Based on a symmetric consideration, we investigate if the design of artificial neural networks should have a directional preference, and what the mechanism of interaction is between the width and depth of a network. Inspired by the De Morgan law, we address this fundamental question by establishing a quasi-equivalence between the width and depth of ReLU networks. We formulate two transforms for mapping an arbitrary ReLU network to a wide ReLU network and a deep ReLU network respectively, so that the essentially same capability of the original network can be implemented. Based on our findings, a deep network has a wide equivalent, and vice versa, subject to an arbitrarily small error.
Naive regression requires weaker assumptions than factor models to adjust for multiple cause confounding
http://jmlr.org/papers/v24/21-0515.html
http://jmlr.org/papers/volume24/21-0515/21-0515.pdf
2023Justin Grimmer, Dean Knox, Brandon Stewart
The empirical practice of using factor models to adjust for shared, unobserved confounders, $\boldsymbol{Z}$, in observational settings with multiple treatments, $\boldsymbol{A}$, is widespread in fields including genetics, networks, medicine, and politics. Wang and Blei (2019, WB) generalize these procedures to develop the “deconfounder,” a causal inference method using factor models of $\boldsymbol{A}$ to estimate “substitute confounders,” $\widehat{\boldsymbol{Z}}$, then estimating treatment effects---regressing the outcome, $\boldsymbol{Y}$, on part of $\boldsymbol{A}$ while adjusting for $\widehat{\boldsymbol{Z}}$. WB claim the deconfounder is unbiased when (among other assumptions) there are no single-cause confounders and $\widehat{\boldsymbol{Z}}$ is “pinpointed.” We clarify pinpointing requires each confounder to affect infinitely many treatments. We prove that when the conditions hold for the deconfounder to be asymptotically unbiased, a naive semiparametric regression of $\boldsymbol{Y}$ on $\boldsymbol{A}$ which ignores confounding is also asymptotically unbiased. We provide bias formulas for finite numbers of treatments and show that different deconfounders exhibit different kinds of bias. We replicate every deconfounder analysis with available data and find that neither the naive regression nor the deconfounder consistently outperform the other. In practice, the deconfounder produces implausible estimates in WB's case study of movie earnings: estimates suggest comic author Stan Lee's cameo appearances causally contributed $15.5 billion, most of Marvel movie revenue. We conclude neither approach is a viable substitute for careful research design in real-world applications.
Factor Graph Neural Networks
http://jmlr.org/papers/v24/21-0434.html
http://jmlr.org/papers/volume24/21-0434/21-0434.pdf
2023Zhen Zhang, Mohammed Haroon Dupty, Fan Wu, Javen Qinfeng Shi, Wee Sun Lee
In recent years, we have witnessed a surge of Graph Neural Networks (GNNs), most of which can learn powerful representations in an end-to-end fashion with great success in many real-world applications. They have resemblance to Probabilistic Graphical Models (PGMs), but break free from some limitations of PGMs. By aiming to provide expressive methods for representation learning instead of computing marginals or most likely configurations, GNNs provide flexibility in the choice of information flowing rules while maintaining good performance. Despite their success and inspirations, they lack efficient ways to represent and learn higher-order relations among variables/nodes. More expressive higher-order GNNs which operate on k-tuples of nodes need increased computational resources in order to process higher-order tensors. We propose Factor Graph Neural Networks (FGNNs) to effectively capture higher-order relations for inference and learning. To do so, we first derive an efficient approximate Sum-Product loopy belief propagation inference algorithm for discrete higher-order PGMs. We then neuralize the novel message passing scheme into a Factor Graph Neural Network (FGNN) module by allowing richer representations of the message update rules; this facilitates both efficient inference and powerful end-to-end learning. We further show that with a suitable choice of message aggregation operators, our FGNN is also able to represent Max-Product belief propagation, providing a single family of architecture that can represent both Max and Sum-Product loopy belief propagation. Our extensive experimental evaluation on synthetic as well as real datasets demonstrates the potential of the proposed model.
Dropout Training is Distributionally Robust Optimal
http://jmlr.org/papers/v24/21-0377.html
http://jmlr.org/papers/volume24/21-0377/21-0377.pdf
2023José Blanchet, Yang Kang, José Luis Montiel Olea, Viet Anh Nguyen, Xuhui Zhang
This paper shows that dropout training in generalized linear models is the minimax solution of a two-player, zero-sum game where an adversarial nature corrupts a statistician's covariates using a multiplicative nonparametric errors-in-variables model. In this game, nature's least favorable distribution is dropout noise, where nature independently deletes entries of the covariate vector with some fixed probability $\delta$. This result implies that dropout training indeed provides out-of-sample expected loss guarantees for distributions that arise from multiplicative perturbations of in-sample data. The paper makes a concrete recommendation on how to select the tuning parameter $\delta$. The paper also provides a novel, parallelizable, unbiased multi-level Monte Carlo algorithm to speed-up the implementation of dropout training. Our algorithm has a much smaller computational cost compared to the naive implementation of dropout, provided the number of data points is much smaller than the dimension of the covariate vector.
Variational Inference for Deblending Crowded Starfields
http://jmlr.org/papers/v24/21-0169.html
http://jmlr.org/papers/volume24/21-0169/21-0169.pdf
2023Runjing Liu, Jon D. McAuliffe, Jeffrey Regier, The LSST Dark Energy Science Collaboration
In images collected by astronomical surveys, stars and galaxies often overlap visually. Deblending is the task of distinguishing and characterizing individual light sources in survey images. We propose StarNet, a Bayesian method to deblend sources in astronomical images of crowded star fields. StarNet leverages recent advances in variational inference, including amortized variational distributions and an optimization objective targeting an expectation of the forward KL divergence. In our experiments with SDSS images of the M2 globular cluster, StarNet is substantially more accurate than two competing methods: Probabilistic Cataloging (PCAT), a method that uses MCMC for inference, and DAOPHOT, a software pipeline employed by SDSS for deblending. In addition, the amortized approach to inference gives StarNet the scaling characteristics necessary to perform Bayesian inference on modern astronomical surveys.
F2A2: Flexible Fully-decentralized Approximate Actor-critic for Cooperative Multi-agent Reinforcement Learning
http://jmlr.org/papers/v24/20-700.html
http://jmlr.org/papers/volume24/20-700/20-700.pdf
2023Wenhao Li, Bo Jin, Xiangfeng Wang, Junchi Yan, Hongyuan Zha
Traditional centralized multi-agent reinforcement learning (MARL) algorithms are sometimes unpractical in complicated applications due to non-interactivity between agents, the curse of dimensionality, and computation complexity. Hence, several decentralized MARL algorithms are motivated. However, existing decentralized methods only handle the fully cooperative setting where massive information needs to be transmitted in training. The block coordinate gradient descent scheme they used for successive independent actor and critic steps can simplify the calculation, but it causes serious bias. This paper proposes a flexible fully decentralized actor-critic MARL framework, which can combine most of the actor-critic methods and handle large-scale general cooperative multi-agent settings. A primal-dual hybrid gradient descent type algorithm framework is designed to learn individual agents separately for decentralization. From the perspective of each agent, policy improvement and value evaluation are jointly optimized, which can stabilize multi-agent policy learning. Furthermore, the proposed framework can achieve scalability and stability for the large-scale environment. This framework also reduces information transmission by the parameter sharing mechanism and novel modeling-other-agents methods based on theory-of-mind and online supervised learning. Sufficient experiments in cooperative Multi-agent Particle Environment and StarCraft II show that the proposed decentralized MARL instantiation algorithms perform competitively against conventional centralized and decentralized methods.
Comprehensive Algorithm Portfolio Evaluation using Item Response Theory
http://jmlr.org/papers/v24/20-1318.html
http://jmlr.org/papers/volume24/20-1318/20-1318.pdf
2023Sevvandi Kandanaarachchi, Kate Smith-Miles
Item Response Theory (IRT) has been proposed within the field of Educational Psychometrics to assess student ability as well as test question difficulty and discrimination power. More recently, IRT has been applied to evaluate machine learning algorithm performance on a single classification dataset, where the student is now an algorithm, and the test question is an observation to be classified by the algorithm. In this paper we present a modified IRT-based framework for evaluating a portfolio of algorithms across a repository of datasets, while simultaneously eliciting a richer suite of characteristics - such as algorithm consistency and anomalousness - that describe important aspects of algorithm performance. These characteristics arise from a novel inversion and reinterpretation of the traditional IRT model without requiring additional dataset feature computations. We test this framework on algorithm portfolios for a wide range of applications, demonstrating the broad applicability of this method as an insightful algorithm evaluation tool. Furthermore, the explainable nature of IRT parameters yield an increased understanding of algorithm portfolios.
Evaluating Instrument Validity using the Principle of Independent Mechanisms
http://jmlr.org/papers/v24/20-1287.html
http://jmlr.org/papers/volume24/20-1287/20-1287.pdf
2023Patrick F. Burauel
The validity of instrumental variables to estimate causal effects is typically justified narratively and often remains controversial. Critical assumptions are difficult to evaluate since they involve unobserved variables. Building on Janzing and Schoelkopf's (2018) method to quantify a degree of confounding in multivariate linear models, we develop a test that evaluates instrument validity without relying on Balke and Pearl's (1997) inequality constraints. Instead, our approach is based on the Principle of Independent Mechanisms, which states that causal models have a modular structure. Monte Carlo studies show a high accuracy of the procedure. We apply our method to two empirical studies: first, we can corroborate the narrative justification given by Card (1995) for the validity of college proximity as an instrument for educational attainment in his work on the financial returns to education. Second, we cannot reject the validity of past savings rates as an instrument for economic development to estimate its causal effect on democracy (Acemoglu et al, 2008).
Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal Sample Complexity
http://jmlr.org/papers/v24/20-1131.html
http://jmlr.org/papers/volume24/20-1131/20-1131.pdf
2023Kaiqing Zhang, Sham M. Kakade, Tamer Basar, Lin F. Yang
Model-based reinforcement learning (RL), which finds an optimal policy after establishing an empirical model, has long been recognized as one of the cornerstones of RL. It is especially suitable for multi-agent RL (MARL), as it naturally decouples the learning and the planning phases, and avoids the non-stationarity problem when all agents are improving their policies simultaneously. Though intuitive and widely-used, the sample complexity of model-based MARL algorithms has not been fully investigated. In this paper, we aim to ad- dress the fundamental question about its sample complexity. We study arguably the most basic MARL setting: two-player discounted zero-sum Markov games, given only access to a generative model. We show that model-based MARL achieves a sample complexity of Oe(|S||A||B|(1 − γ)−3ε−2) for finding the Nash equilibrium (NE) value up to some ε error, and the ε-NE policies with a smooth planning oracle, where γ is the discount factor, and S,A,B denote the state space, and the action spaces for the two agents. We further show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge, by establishing a matching lower bound. This is in contrast to the usual reward- aware setting, where the sample complexity lower bound is Ωe(|S|(|A| + |B|)(1 − γ)−3ε−2), and this model-based approach is near-optimal with only a gap on the |A|, |B| dependence. Our results not only illustrate the sample-efficiency of this basic model-based MARL approach, but also elaborate on the fundamental tradeoff between its power (easily handling the reward-agnostic case) and limitation (less adaptive and suboptimal in |A|, |B|), which particularly arises in the multi-agent context.
Posterior Consistency for Bayesian Relevance Vector Machines
http://jmlr.org/papers/v24/20-1012.html
http://jmlr.org/papers/volume24/20-1012/20-1012.pdf
2023Xiao Fang, Malay Ghosh
Statistical modeling and inference problems with sample sizes substantially smaller than the number of available covariates are challenging. Chakraborty et al. (2012) did a full hierarchical Bayesian analysis of nonlinear regression in such situations using relevance vector machines based on reproducing kernel Hilbert space (RKHS). But they did not provide any theoretical properties associated with their procedure. The present paper revisits their problem, introduces a new class of global-local priors different from theirs, and provides results on posterior consistency as well as on posterior contraction rates.
From Classification Accuracy to Proper Scoring Rules: Elicitability of Probabilistic Top List Predictions
http://jmlr.org/papers/v24/23-0106.html
http://jmlr.org/papers/volume24/23-0106/23-0106.pdf
2023Johannes Resin
In the face of uncertainty, the need for probabilistic assessments has long been recognized in the literature on forecasting. In classification, however, comparative evaluation of classifiers often focuses on predictions specifying a single class through the use of simple accuracy measures, which disregard any probabilistic uncertainty quantification. I propose probabilistic top lists as a novel type of prediction in classification, which bridges the gap between single-class predictions and predictive distributions. The probabilistic top list functional is elicitable through the use of strictly consistent evaluation metrics. The proposed evaluation metrics are based on symmetric proper scoring rules and admit comparison of various types of predictions ranging from single-class point predictions to fully specified predictive distributions. The Brier score yields a metric that is particularly well suited for this kind of comparison.
Beyond the Golden Ratio for Variational Inequality Algorithms
http://jmlr.org/papers/v24/22-1488.html
http://jmlr.org/papers/volume24/22-1488/22-1488.pdf
2023Ahmet Alacaoglu, Axel Böhm, Yura Malitsky
We improve the understanding of the golden ratio algorithm, which solves monotone variational inequalities (VI) and convex-concave min-max problems via the distinctive feature of adapting the step sizes to the local Lipschitz constants. Adaptive step sizes not only eliminate the need to pick hyperparameters, but they also remove the necessity of global Lipschitz continuity and can increase from one iteration to the next. We first establish the equivalence of this algorithm with popular VI methods such as reflected gradient, Popov or optimistic gradient descent-ascent (OGDA) in the unconstrained case with constant step sizes. We then move on to the constrained setting and introduce a new analysis that allows to use larger step sizes, to complete the bridge between the golden ratio algorithm and the existing algorithms in the literature. Doing so, we actually eliminate the link between the golden ratio {$\frac{1+\sqrt{5}}{2}$} and the algorithm. Moreover, we improve the adaptive version of the algorithm, first by removing the maximum step size hyperparameter (an artifact from the analysis), and secondly, by adjusting it to nonmonotone problems with weak Minty solutions, with superior empirical performance.
Incremental Learning in Diagonal Linear Networks
http://jmlr.org/papers/v24/22-1395.html
http://jmlr.org/papers/volume24/22-1395/22-1395.pdf
2023Raphaël Berthier
Diagonal linear networks (DLNs) are a toy simplification of artificial neural networks; they consist in a quadratic reparametrization of linear regression inducing a sparse implicit regularization. In this paper, we describe the trajectory of the gradient flow of DLNs in the limit of small initialization. We show that incremental learning is effectively performed in the limit: coordinates are successively activated, while the iterate is the minimizer of the loss constrained to have support on the active coordinates only. This shows that the sparse implicit regularization of DLNs decreases with time. This work is restricted to the underparametrized regime with anti-correlated features for technical reasons.
Small Transformers Compute Universal Metric Embeddings
http://jmlr.org/papers/v24/22-1246.html
http://jmlr.org/papers/volume24/22-1246/22-1246.pdf
2023Anastasis Kratsios, Valentin Debarnot, Ivan Dokmanić
We study representations of data from an arbitrary metric space $\mathcal{X}$ in the space of univariate Gaussian mixtures equipped with a transport metric (Delon and Desolneux 2020). We prove embedding guarantees for feature maps implemented by small neural networks called probabilistic transformers. Our guarantees are of memorization type: we prove that a probabilistic transformer of depth about $n\log(n)$ and width about $n^2$ can bi-H\"older embed any $n$-point dataset from $\mathcal{X}$ with low metric distortion, thus avoiding the curse of dimensionality. We further derive probabilistic bi-Lipschitz guarantees, which trade off the amount of distortion and the probability that a randomly chosen pair of points embeds with that distortion. If the geometry of $\mathcal{X}$ is sufficiently regular, we obtain stronger bi-Lipschitz guarantees for all points. As applications, we derive neural embedding guarantees for datasets from Riemannian manifolds, metric trees, and certain types of combinatorial graphs. When instead embedding into multivariate Gaussian mixtures, we show that probabilistic transformers compute bi-Hölder embeddings with arbitrarily small distortion. Our results show that any finite metric dataset, from vertices on a graph to functions a function space, can be faithfully represented in a single representation space, and that the representation can be implemented by a simple transformer architecture. Thus one may only need a modular set of machine learning tools compatible with this one representation space, many of which already exist, for downstream supervised and unsupervised learning from a great variety of data types.
DART: Distance Assisted Recursive Testing
http://jmlr.org/papers/v24/22-1131.html
http://jmlr.org/papers/volume24/22-1131/22-1131.pdf
2023Xuechan Li, Anthony D. Sung, Jichun Xie
Multiple testing is a commonly used tool in modern data science. Sometimes, the hypotheses are embedded in a space; the distances between the hypotheses reflect their co-null/co- alternative patterns. Properly incorporating the distance information in testing will boost testing power. Hence, we developed a new multiple testing framework named Distance Assisted Recursive Testing (DART). DART features in joint artificial intelligence (AI) and statistics modeling. It has two stages. The first stage uses AI models to construct an aggregation tree that reflects the distance information. The second stage uses statistical models to embed the testing on the tree and control the false discovery rate. Theoretical analysis and numerical experiments demonstrated that DART generates valid, robust, and powerful results. We applied DART to a clinical trial in the allogeneic stem cell transplantation study to identify the gut microbiota whose abundance was impacted by post-transplant care.
Inference on the Change Point under a High Dimensional Covariance Shift
http://jmlr.org/papers/v24/22-1122.html
http://jmlr.org/papers/volume24/22-1122/22-1122.pdf
2023Abhishek Kaul, Hongjin Zhang, Konstantinos Tsampourakis, George Michailidis
We consider the problem of constructing asymptotically valid confidence intervals for the change point in a high-dimensional covariance shift setting. A novel estimator for the change point parameter is developed, and its asymptotic distribution under high dimensional scaling obtained. We establish that the proposed estimator exhibits a sharp $O_p(\psi^{-2})$ rate of convergence, wherein $\psi$ represents the jump size between model parameters before and after the change point. Further, the form of the asymptotic distributions under both a vanishing and a non-vanishing regime of the jump size are characterized. In the former case, it corresponds to the argmax of an asymmetric Brownian motion, while in the latter case to the argmax of an asymmetric random walk. We then obtain the relationship between these distributions, which allows construction of regime (vanishing vs non-vanishing) adaptive confidence intervals. Easy to implement algorithms for the proposed methodology are developed and their performance illustrated on synthetic and real data sets.
Bilevel Optimization with a Lower-level Contraction: Optimal Sample Complexity without Warm-Start
http://jmlr.org/papers/v24/22-1043.html
http://jmlr.org/papers/volume24/22-1043/22-1043.pdf
2023Riccardo Grazzi, Massimiliano Pontil, Saverio Salzo
We analyse a general class of bilevel problems, in which the upper-level problem consists in the minimization of a smooth objective function and the lower-level problem is to find the fixed point of a smooth contraction map. This type of problems include instances of meta-learning, equilibrium models, hyperparameter optimization and data poisoning adversarial attacks. Several recent works have proposed algorithms which warm-start the lower-level problem, i.e. they use the previous lower-level approximate solution as a staring point for the lower-level solver. This warm-start procedure allows one to improve the sample complexity in both the stochastic and deterministic settings, achieving in some cases the order-wise optimal sample complexity. However, there are situations, e.g., meta learning and equilibrium models, in which the warm-start procedure is not well-suited or ineffective. In this work we show that without warm-start, it is still possible to achieve order-wise (near) optimal sample complexity. In particular, we propose a simple method which uses (stochastic) fixed point iterations at the lower-level and projected inexact gradient descent at the upper-level, that reaches an $\epsilon$-stationary point using $O(\epsilon^{-2})$ and $\tilde{O}(\epsilon^{-1})$ samples for the stochastic and the deterministic setting, respectively. Finally, compared to methods using warm-start, our approach yields a simpler analysis that does not need to study the coupled interactions between the upper-level and lower-level iterates.
A Parameter-Free Conditional Gradient Method for Composite Minimization under Hölder Condition
http://jmlr.org/papers/v24/22-0983.html
http://jmlr.org/papers/volume24/22-0983/22-0983.pdf
2023Masaru Ito, Zhaosong Lu, Chuan He
In this paper we consider a composite optimization problem that minimizes the sum of a weakly smooth function and a convex function with either a bounded domain or a uniformly convex structure. In particular, we first present a parameter-dependent conditional gradient method for this problem, whose step sizes require prior knowledge of the parameters associated with the Hölder continuity of the gradient of the weakly smooth function, and establish its rate of convergence. Given that these parameters could be unknown or known but possibly conservative, such a method may suffer from implementation issue or slow convergence. We therefore propose a parameter-free conditional gradient method whose step size is determined by using a constructive local quadratic upper approximation and an adaptive line search scheme, without using any problem parameter. We show that this method achieves the same rate of convergence as the parameter-dependent conditional gradient method. Preliminary experiments are also conducted and illustrate the superior performance of the parameter-free conditional gradient method over the methods with some other step size rules.
Robust Methods for High-Dimensional Linear Learning
http://jmlr.org/papers/v24/22-0964.html
http://jmlr.org/papers/volume24/22-0964/22-0964.pdf
2023Ibrahim Merad, Stéphane Gaïffas
We propose statistically robust and computationally efficient linear learning methods in the high-dimensional batch setting, where the number of features $d$ may exceed the sample size $n$. We employ, in a generic learning setting, two algorithms depending on whether the considered loss function is gradient-Lipschitz or not. Then, we instantiate our framework on several applications including vanilla sparse, group-sparse and low-rank matrix recovery. This leads, for each application, to efficient and robust learning algorithms, that reach near-optimal estimation rates under heavy-tailed distributions and the presence of outliers. For vanilla $s$-sparsity, we are able to reach the $s\log (d)/n$ rate under heavy-tails and $\eta$-corruption, at a computational cost comparable to that of non-robust analogs. We provide an efficient implementation of our algorithms in an open-source Python library called linlearn, by means of which we carry out numerical experiments which confirm our theoretical findings together with a comparison to other recent approaches proposed in the literature.
A Framework and Benchmark for Deep Batch Active Learning for Regression
http://jmlr.org/papers/v24/22-0937.html
http://jmlr.org/papers/volume24/22-0937/22-0937.pdf
2023David Holzmüller, Viktor Zaverkin, Johannes Kästner, Ingo Steinwart
The acquisition of labels for supervised learning can be expensive. To improve the sample efficiency of neural network regression, we study active learning methods that adaptively select batches of unlabeled data for labeling. We present a framework for constructing such methods out of (network-dependent) base kernels, kernel transformations, and selection methods. Our framework encompasses many existing Bayesian methods based on Gaussian process approximations of neural networks as well as non-Bayesian methods. Additionally, we propose to replace the commonly used last-layer features with sketched finite-width neural tangent kernels and to combine them with a novel clustering method. To evaluate different methods, we introduce an open-source benchmark consisting of 15 large tabular regression data sets. Our proposed method outperforms the state-of-the-art on our benchmark, scales to large data sets, and works out-of-the-box without adjusting the network architecture or training code. We provide open-source code that includes efficient implementations of all kernels, kernel transformations, and selection methods, and can be used for reproducing our results.
Preconditioned Gradient Descent for Overparameterized Nonconvex Burer--Monteiro Factorization with Global Optimality Certification
http://jmlr.org/papers/v24/22-0882.html
http://jmlr.org/papers/volume24/22-0882/22-0882.pdf
2023Gavin Zhang, Salar Fattahi, Richard Y. Zhang
We consider using gradient descent to minimize the nonconvex function $f(X)=\phi(XX^{T})$ over an $n\times r$ factor matrix $X$, in which $\phi$ is an underlying smooth convex cost function defined over $n\times n$ matrices. While only a second-order stationary point $X$ can be provably found in reasonable time, if $X$ is additionally rank deficient, then its rank deficiency certifies it as being globally optimal. This way of certifying global optimality necessarily requires the search rank $r$ of the current iterate $X$ to be overparameterized with respect to the rank $r^{\star}$ of the global minimizer $X^{\star}$. Unfortunately, overparameterization significantly slows down the convergence of gradient descent, from a linear rate with $r=r^{\star}$ to a sublinear rate when $r>r^{\star}$, even when $\phi$ is strongly convex. In this paper, we propose an inexpensive preconditioner that restores the convergence rate of gradient descent back to linear in the overparameterized case, while also making it agnostic to possible ill-conditioning in the global minimizer $X^{\star}$.
Flexible Model Aggregation for Quantile Regression
http://jmlr.org/papers/v24/22-0799.html
http://jmlr.org/papers/volume24/22-0799/22-0799.pdf
2023Rasool Fakoor, Taesup Kim, Jonas Mueller, Alexander J. Smola, Ryan J. Tibshirani
Quantile regression is a fundamental problem in statistical learning motivated by a need to quantify uncertainty in predictions, or to model a diverse population without being overly reductive. For instance, epidemiological forecasts, cost estimates, and revenue predictions all benefit from being able to quantify the range of possible values accurately. As such, many models have been developed for this problem over many years of research in statistics, machine learning, and related fields. Rather than proposing yet another (new) algorithm for quantile regression we adopt a meta viewpoint: we investigate methods for aggregating any number of conditional quantile models, in order to improve accuracy and robustness. We consider weighted ensembles where weights may vary over not only individual models, but also over quantile levels, and feature values. All of the models we consider in this paper can be fit using modern deep learning toolkits, and hence are widely accessible (from an implementation point of view) and scalable. To improve the accuracy of the predicted quantiles (or equivalently, prediction intervals), we develop tools for ensuring that quantiles remain monotonically ordered, and apply conformal calibration methods. These can be used without any modification of the original library of base models. We also review some basic theory surrounding quantile aggregation and related scoring rules, and contribute a few new results to this literature (for example, the fact that post sorting or post isotonic regression can only improve the weighted interval score). Finally, we provide an extensive suite of empirical comparisons across 34 data sets from two different benchmark repositories.
q-Learning in Continuous Time
http://jmlr.org/papers/v24/22-0755.html
http://jmlr.org/papers/volume24/22-0755/22-0755.pdf
2023Yanwei Jia, Xun Yu Zhou
We study the continuous-time counterpart of Q-learning for reinforcement learning (RL) under the entropy-regularized, exploratory diffusion process formulation introduced by Wang et al. (2020). As the conventional (big) Q-function collapses in continuous time, we consider its first-order approximation and coin the term “(little) q-function". This function is related to the instantaneous advantage rate function as well as the Hamiltonian. We develop a “q-learning" theory around the q-function that is independent of time discretization. Given a stochastic policy, we jointly characterize the associated q-function and value function by martingale conditions of certain stochastic processes, in both on-policy and off-policy settings. We then apply the theory to devise different actor--critic algorithms for solving underlying RL problems, depending on whether or not the density function of the Gibbs measure generated from the q-function can be computed explicitly. One of our algorithms interprets the well-known Q-learning algorithm SARSA, and another recovers a policy gradient (PG) based continuous-time algorithm proposed in Jia and Zhou (2022b). Finally, we conduct simulation experiments to compare the performance of our algorithms with those of PG-based algorithms in Jia and Zhou (2022b) and time-discretized conventional Q-learning algorithms.
Multivariate Soft Rank via Entropy-Regularized Optimal Transport: Sample Efficiency and Generative Modeling
http://jmlr.org/papers/v24/22-0616.html
http://jmlr.org/papers/volume24/22-0616/22-0616.pdf
2023Shoaib Bin Masud, Matthew Werenski, James M. Murphy, Shuchin Aeron
The framework of optimal transport has been leveraged to extend the notion of rank to the multivariate setting as corresponding to an optimal transport map, while preserving desirable properties of the resulting goodness-of-fit (GoF) statistics. In particular, the rank energy (RE) and rank maximum mean discrepancy (RMMD) are distribution-free under the null, exhibit high power in statistical testing, and are robust to outliers. In this paper, we point to and alleviate some of the shortcomings of these GoF statistics that are of practical significance, namely high computational cost, curse of dimensionality in statistical sample complexity, and lack of differentiability with respect to the data. We show that all these issues are addressed by defining multivariate rank as an entropic transport map derived from the entropic regularization of the optimal transport problem, which we refer to as the soft rank. We consequently propose two new statistics, the soft rank energy (sRE) and soft rank maximum mean discrepancy (sRMMD). Given n sample data points, we provide non-asymptotic convergence rates for the sample estimate of the entropic transport map to its population version that are essentially of the order n^(-1/2) when the source measure is subgaussian and the target measure has compact support. This result is novel compared to existing results which achieve a rate of n^(-1) but crucially rely on both measures having compact support. In contrast, the corresponding convergence rate of estimating an optimal transport map, and hence the rank map, is exponential in the data dimension. We leverage these fast convergence rates to show that the sample estimates of sRE and sRMMD converge rapidly to their population versions. Combined with the computational efficiency of methods in solving the entropy-regularized optimal transport problem, these results enable efficient rank-based GoF statistical computation, even in high dimensions. Furthermore, the sample estimates of sRE and sRMMD are differentiable with respect to the data and amenable to popular machine learning frameworks that rely on gradient methods. We leverage these properties towards showcasing their utility for generative modeling on two important problems: image generation and generating valid knockoffs for controlled feature selection.
Infinite-dimensional optimization and Bayesian nonparametric learning of stochastic differential equations
http://jmlr.org/papers/v24/22-0582.html
http://jmlr.org/papers/volume24/22-0582/22-0582.pdf
2023Arnab Ganguly, Riten Mitra, Jinpu Zhou
The paper has two major themes. The first part of the paper establishes certain general results for infinite-dimensional optimization problems on Hilbert spaces. These results cover the classical representer theorem and many of its variants as special cases and offer a wider scope of applications. The second part of the paper then develops a systematic approach for learning the drift function of a stochastic differential equation by integrating the results of the first part with Bayesian hierarchical framework. Importantly, our Bayesian approach incorporates low-cost sparse learning through proper use of shrinkage priors while allowing proper quantification of uncertainty through posterior distributions. Several examples at the end illustrate the accuracy of our learning scheme.
Asynchronous Iterations in Optimization: New Sequence Results and Sharper Algorithmic Guarantees
http://jmlr.org/papers/v24/22-0555.html
http://jmlr.org/papers/volume24/22-0555/22-0555.pdf
2023Hamid Reza Feyzmahdavian, Mikael Johansson
We introduce novel convergence results for asynchronous iterations that appear in the analysis of parallel and distributed optimization algorithms. The results are simple to apply and give explicit estimates for how the degree of asynchrony impacts the convergence rates of the iterates. Our results shorten, streamline and strengthen existing convergence proofs for several asynchronous optimization methods and allow us to establish convergence guarantees for popular algorithms that were thus far lacking a complete theoretical understanding. Specifically, we use our results to derive better iteration complexity bounds for proximal incremental aggregated gradient methods, to obtain tighter guarantees depending on the average rather than maximum delay for the asynchronous stochastic gradient descent method, to provide less conservative analyses of the speedup conditions for asynchronous block-coordinate implementations of Krasnoselskii–Mann iterations, and to quantify the convergence rates for totally asynchronous iterations under various assumptions on communication delays and update rates.
Restarted Nonconvex Accelerated Gradient Descent: No More Polylogarithmic Factor in the in the O(epsilon^(-7/4)) Complexity
http://jmlr.org/papers/v24/22-0522.html
http://jmlr.org/papers/volume24/22-0522/22-0522.pdf
2023Huan Li, Zhouchen Lin
This paper studies accelerated gradient methods for nonconvex optimization with Lipschitz continuous gradient and Hessian. We propose two simple accelerated gradient methods, restarted accelerated gradient descent (AGD) and restarted heavy ball (HB) method, and establish that our methods achieve an $\epsilon$-approximate first-order stationary point within $O(\epsilon^{-7/4})$ number of gradient evaluations by elementary proofs. Theoretically, our complexity does not hide any polylogarithmic factors, and thus it improves over the best known one by the $O(\log\frac{1}{\epsilon})$ factor. Our algorithms are simple in the sense that they only consist of Nesterov's classical AGD or Polyak's HB iterations, as well as a restart mechanism. They do not invoke negative curvature exploitation or minimization of regularized surrogate functions as the subroutines. In contrast with existing analysis, our elementary proofs use less advanced techniques and do not invoke the analysis of strongly convex AGD or HB.
Integrating Random Effects in Deep Neural Networks
http://jmlr.org/papers/v24/22-0501.html
http://jmlr.org/papers/volume24/22-0501/22-0501.pdf
2023Giora Simchoni, Saharon Rosset
Modern approaches to supervised learning like deep neural networks (DNNs) typically implicitly assume that observed responses are statistically independent. In contrast, correlated data are prevalent in real-life large-scale applications, with typical sources of correlation including spatial, temporal and clustering structures. These correlations are either ignored by DNNs, or ad-hoc solutions are developed for specific use cases. We propose to use the mixed models framework to handle correlated data in DNNs. By treating the effects underlying the correlation structure as random effects, mixed models are able to avoid overfitted parameter estimates and ultimately yield better predictive performance. The key to combining mixed models and DNNs is using the Gaussian negative log-likelihood (NLL) as a natural loss function that is minimized with DNN machinery including stochastic gradient descent (SGD). Since NLL does not decompose like standard DNN loss functions, the use of SGD with NLL presents some theoretical and implementation challenges, which we address. Our approach which we call LMMNN is demonstrated to improve performance over natural competitors in various correlation scenarios on diverse simulated and real datasets. Our focus is on a regression setting and tabular datasets, but we also show some results for classification. Our code is available at https://github.com/gsimchoni/lmmnn.
Adaptive Data Depth via Multi-Armed Bandits
http://jmlr.org/papers/v24/22-0497.html
http://jmlr.org/papers/volume24/22-0497/22-0497.pdf
2023Tavor Baharav, Tze Leung Lai
Data depth, introduced by Tukey (1975), is an important tool in data science, robust statistics, and computational geometry. One chief barrier to its broader practical utility is that many common measures of depth are computationally intensive, requiring on the order of $n^d$ operations to exactly compute the depth of a single point within a data set of $n$ points in $d$-dimensional space. Often however, we are not directly interested in the absolute depths of the points, but rather in their relative ordering. For example, we may want to find the most central point in a data set (a generalized median), or to identify and remove all outliers (points on the fringe of the data set with low depth). With this observation, we develop a novel instance-adaptive algorithm for adaptive data depth computation by reducing the problem of exactly computing $n$ depths to an $n$-armed stochastic multi-armed bandit problem which we can efficiently solve. We focus our exposition on simplicial depth, developed by Liu (1990), which has emerged as a promising notion of depth due to its interpretability and asymptotic properties. We provide general data-dependent theoretical guarantees for our proposed algorithms, which readily extend to many other common measures of data depth including majority depth, Oja depth, and likelihood depth. When specialized to the case where the gaps in the data follow a power law distribution with parameter $\alpha<2$, we reduce the complexity of identifying the deepest point in the data set (the simplicial median) from $O(n^d)$ to $\tilde{O}(n^{d-(d-1)\alpha/2})$, where $\tilde{O}$ suppresses a logarithmic factor. We corroborate our theoretical results with numerical experiments on synthetic data, showing the practical utility of our proposed methods.
Adapting and Evaluating Influence-Estimation Methods for Gradient-Boosted Decision Trees
http://jmlr.org/papers/v24/22-0449.html
http://jmlr.org/papers/volume24/22-0449/22-0449.pdf
2023Jonathan Brophy, Zayd Hammoudeh, Daniel Lowd
Influence estimation analyzes how changes to the training data can lead to different model predictions; this analysis can help us better understand these predictions, the models making those predictions, and the data sets they are trained on. However, most influence-estimation techniques are designed for deep learning models with continuous parameters. Gradient-boosted decision trees (GBDTs) are a powerful and widely-used class of models; however, these models are black boxes with opaque decision-making processes. In the pursuit of better understanding GBDT predictions and generally improving these models, we adapt recent and popular influence-estimation methods designed for deep learning models to GBDTs. Specifically, we adapt representer-point methods and TracIn, denoting our new methods TREX and BoostIn, respectively; source code is available at https://github.com/jjbrophy47/treeinfluence. We compare these methods to LeafInfluence and other baselines using 5 different evaluation measures on 22 real-world data sets with 4 popular GBDT implementations. These experiments give us a comprehensive overview of how different approaches to influence estimation work in GBDT models. We find BoostIn is an efficient influence-estimation method for GBDTs that performs equally well or better than existing work while being four orders of magnitude faster. Our evaluation also suggests the gold-standard approach of leave-one-out (LOO) retraining consistently identifies the single-most influential training example but performs poorly at finding the most influential set of training examples for a given target prediction.
Consistent Model-based Clustering using the Quasi-Bernoulli Stick-breaking Process
http://jmlr.org/papers/v24/22-0436.html
http://jmlr.org/papers/volume24/22-0436/22-0436.pdf
2023Cheng Zeng, Jeffrey W Miller, Leo L Duan
In mixture modeling and clustering applications, the number of components and clusters is often not known. A stick-breaking mixture model, such as the Dirichlet process mixture model, is an appealing construction that assumes infinitely many components, while shrinking the weights of most of the unused components to near zero. However, it is well-known that this shrinkage is inadequate: even when the component distribution is correctly specified, spurious weights appear and give an inconsistent estimate of the number of clusters. In this article, we propose a simple solution: when breaking each mixture weight stick into two pieces, the length of the second piece is multiplied by a quasi-Bernoulli random variable, taking value one or a small constant close to zero. This effectively creates a soft truncation and further shrinks the unused weights. Asymptotically, we show that as long as this small constant diminishes to zero at a rate faster than $o(1/n^2)$, with $n$ the sample size and given data from a finite mixture model, the posterior distribution will converge to the true number of clusters. In comparison, we rigorously explore Dirichlet process mixture models using a concentration parameter that is either constant or rapidly diminishes to zero---both of which lead to inconsistency for the number of clusters. Our proposed model is easy to implement, requiring only a small modification of a standard Gibbs sampler for mixture models. In simulations and a data application of clustering brain networks, our proposed method recovers the ground-truth number of clusters, and leads to a small number of clusters.
Selective inference for k-means clustering
http://jmlr.org/papers/v24/22-0371.html
http://jmlr.org/papers/volume24/22-0371/22-0371.pdf
2023Yiqun T. Chen, Daniela M. Witten
We consider the problem of testing for a difference in means between clusters of observations identified via k-means clustering. In this setting, classical hypothesis tests lead to an inflated Type I error rate. In recent work, Gao et al. (2022) considered a related problem in the context of hierarchical clustering. Unfortunately, their solution is highly-tailored to the context of hierarchical clustering, and thus cannot be applied in the setting of k-means clustering. In this paper, we propose a p-value that conditions on all of the intermediate clustering assignments in the k-means algorithm. We show that the p-value controls the selective Type I error for a test of the difference in means between a pair of clusters obtained using k-means clustering in finite samples, and can be efficiently computed. We apply our proposal on hand-written digits data and on single-cell RNA-sequencing data.
Generalization error bounds for multiclass sparse linear classifiers
http://jmlr.org/papers/v24/22-0367.html
http://jmlr.org/papers/volume24/22-0367/22-0367.pdf
2023Tomer Levy, Felix Abramovich
We consider high-dimensional multiclass classification by sparse multinomial logistic regression. Unlike binary classification, in the multiclass setup one can think about an entire spectrum of possible notions of sparsity associated with different structural assumptions on the regression coefficients matrix. We propose a computationally feasible feature selection procedure based on penalized maximum likelihood with convex penalties capturing a specific type of sparsity at hand. In particular, we consider global row-wise sparsity, double row-wise sparsity, and low-rank sparsity, and show that with the properly chosen tuning parameters the derived plug-in classifiers attain the minimax generalization error bounds (in terms of misclassification excess risk) within the corresponding classes of multiclass sparse linear classifiers. The developed approach is general and can be adapted to other types of sparsity as well.
MALib: A Parallel Framework for Population-based Multi-agent Reinforcement Learning
http://jmlr.org/papers/v24/22-0169.html
http://jmlr.org/papers/volume24/22-0169/22-0169.pdf
2023Ming Zhou, Ziyu Wan, Hanjing Wang, Muning Wen, Runzhe Wu, Ying Wen, Yaodong Yang, Yong Yu, Jun Wang, Weinan Zhang
Population-based multi-agent reinforcement learning (PB-MARL) encompasses a range of methods that merge dynamic population selection with multi-agent reinforcement learning algorithms (MARL). While PB-MARL has demonstrated notable achievements in complex multi-agent tasks, its sequential execution is plagued by low computational efficiency due to the diversity in computing patterns and policy combinations. We propose a solution involving a stateless central task dispatcher and stateful workers to handle PB-MARL's subroutines, thereby capitalizing on parallelism across various components for efficient problem-solving. In line with this approach, we introduce MALib, a parallel framework that incorporates a task control model, independent data servers, and an abstraction of MARL training paradigms. The framework has undergone extensive testing and is available under the MIT license (https://github.com/sjtu-marl/malib)
Controlling Wasserstein Distances by Kernel Norms with Application to Compressive Statistical Learning
http://jmlr.org/papers/v24/21-1516.html
http://jmlr.org/papers/volume24/21-1516/21-1516.pdf
2023Titouan Vayer, Rémi Gribonval
Comparing probability distributions is at the crux of many machine learning algorithms. Maximum Mean Discrepancies (MMD) and Wasserstein distances are two classes of distances between probability distributions that have attracted abundant attention in past years. This paper establishes some conditions under which the Wasserstein distance can be controlled by MMD norms. Our work is motivated by the compressive statistical learning (CSL) theory, a general framework for resource-efficient large scale learning in which the training data is summarized in a single vector (called sketch) that captures the information relevant to the considered learning task. Inspired by existing results in CSL, we introduce the Hölder Lower Restricted Isometric Property and show that this property comes with interesting guarantees for compressive statistical learning. Based on the relations between the MMD and the Wasserstein distances, we provide guarantees for compressive statistical learning by introducing and studying the concept of Wasserstein regularity of the learning task, that is when some task-specific metric between probability distributions can be bounded by a Wasserstein distance.
Fast Objective & Duality Gap Convergence for Non-Convex Strongly-Concave Min-Max Problems with PL Condition
http://jmlr.org/papers/v24/21-1471.html
http://jmlr.org/papers/volume24/21-1471/21-1471.pdf
2023Zhishuai Guo, Yan Yan, Zhuoning Yuan, Tianbao Yang
This paper focuses on stochastic methods for solving smooth non-convex strongly-concave min-max problems, which have received increasing attention due to their potential applications in deep learning (e.g., deep AUC maximization, distributionally robust optimization). However, most of the existing algorithms are slow in practice, and their analysis revolves around the convergence to a nearly stationary point. We consider leveraging the Polyak-Lojasiewicz (PL) condition to design faster stochastic algorithms with stronger convergence guarantee. Although PL condition has been utilized for designing many stochastic minimization algorithms, their applications for non-convex min-max optimization remain rare. In this paper, we propose and analyze a generic framework of proximal stage-based method with many well-known stochastic updates embeddable. Fast convergence is established in terms of both the primal objective gap and the duality gap. Compared with existing studies, (i) our analysis is based on a novel Lyapunov function consisting of the primal objective gap and the duality gap of a regularized function, and (ii) the results are more comprehensive with improved rates that have better dependence on the condition number under different assumptions. We also conduct deep and non-deep learning experiments to verify the effectiveness of our methods.
Stochastic Optimization under Distributional Drift
http://jmlr.org/papers/v24/21-1410.html
http://jmlr.org/papers/volume24/21-1410/21-1410.pdf
2023Joshua Cutler, Dmitriy Drusvyatskiy, Zaid Harchaoui
We consider the problem of minimizing a convex function that is evolving according to unknown and possibly stochastic dynamics, which may depend jointly on time and on the decision variable itself. Such problems abound in the machine learning and signal processing literature, under the names of concept drift, stochastic tracking, and performative prediction. We provide novel non-asymptotic convergence guarantees for stochastic algorithms with iterate averaging, focusing on bounds valid both in expectation and with high probability. The efficiency estimates we obtain clearly decouple the contributions of optimization error, gradient noise, and time drift. Notably, we identify a low drift-to-noise regime in which the tracking efficiency of the proximal stochastic gradient method benefits significantly from a step decay schedule. Numerical experiments illustrate our results.
Off-Policy Actor-Critic with Emphatic Weightings
http://jmlr.org/papers/v24/21-1350.html
http://jmlr.org/papers/volume24/21-1350/21-1350.pdf
2023Eric Graves, Ehsan Imani, Raksha Kumaraswamy, Martha White
A variety of theoretically-sound policy gradient algorithms exist for the on-policy setting due to the policy gradient theorem, which provides a simplified form for the gradient. The off-policy setting, however, has been less clear due to the existence of multiple objectives and the lack of an explicit off-policy policy gradient theorem. In this work, we unify these objectives into one off-policy objective, and provide a policy gradient theorem for this unified objective. The derivation involves emphatic weightings and interest functions. We show multiple strategies to approximate the gradients, in an algorithm called Actor Critic with Emphatic weightings (ACE). We prove in a counterexample that previous (semi-gradient) off-policy actor-critic methods—particularly Off-Policy Actor-Critic (OffPAC) and Deterministic Policy Gradient (DPG)—converge to the wrong solution whereas ACE finds the optimal solution. We also highlight why these semi-gradient approaches can still perform well in practice, suggesting strategies for variance reduction in ACE. We empirically study several variants of ACE on two classic control environments and an image-based environment designed to illustrate the tradeoffs made by each gradient approximation. We find that by approximating the emphatic weightings directly, ACE performs as well as or better than OffPAC in all settings tested.
Memory-Based Optimization Methods for Model-Agnostic Meta-Learning and Personalized Federated Learning
http://jmlr.org/papers/v24/21-1301.html
http://jmlr.org/papers/volume24/21-1301/21-1301.pdf
2023Bokun Wang, Zhuoning Yuan, Yiming Ying, Tianbao Yang
In recent years, model-agnostic meta-learning (MAML) has become a popular research area. However, the stochastic optimization of MAML is still underdeveloped. Existing MAML algorithms rely on the “episode” idea by sampling a few tasks and data points to update the meta-model at each iteration. Nonetheless, these algorithms either fail to guarantee convergence with a constant mini-batch size or require processing a large number of tasks at every iteration, which is unsuitable for continual learning or cross-device federated learning where only a small number of tasks are available per iteration or per round. To address these issues, this paper proposes memory-based stochastic algorithms for MAML that converge with vanishing error. The proposed algorithms require sampling a constant number of tasks and data samples per iteration, making them suitable for the continual learning scenario. Moreover, we introduce a communication-efficient memory-based MAML algorithm for personalized federated learning in cross-device (with client sampling) and cross-silo (without client sampling) settings. Our theoretical analysis improves the optimization theory for MAML, and our empirical results corroborate our theoretical findings.
Escaping The Curse of Dimensionality in Bayesian Model-Based Clustering
http://jmlr.org/papers/v24/21-1276.html
http://jmlr.org/papers/volume24/21-1276/21-1276.pdf
2023Noirrit Kiran Chandra, Antonio Canale, David B. Dunson
Bayesian mixture models are widely used for clustering of high-dimensional data with appropriate uncertainty quantification. However, as the dimension of the observations increases, posterior inference often tends to favor too many or too few clusters. This article explains this behavior by studying the random partition posterior in a non-standard setting with a fixed sample size and increasing data dimensionality. We provide conditions under which the finite sample posterior tends to either assign every observation to a different cluster or all observations to the same cluster as the dimension grows. Interestingly, the conditions do not depend on the choice of clustering prior, as long as all possible partitions of observations into clusters have positive prior probabilities, and hold irrespective of the true data-generating model. We then propose a class of latent mixtures for Bayesian clustering (Lamb) on a set of low-dimensional latent variables inducing a partition on the observed data. The model is amenable to scalable posterior inference and we show that it can avoid the pitfalls of high-dimensionality under mild assumptions. The proposed approach is shown to have good performance in simulation studies and an application to inferring cell types based on scRNAseq.
Large sample spectral analysis of graph-based multi-manifold clustering
http://jmlr.org/papers/v24/21-1254.html
http://jmlr.org/papers/volume24/21-1254/21-1254.pdf
2023Nicolas Garcia Trillos, Pengfei He, Chenghui Li
In this work we study statistical properties of graph-based algorithms for multi-manifold clustering (MMC). In MMC the goal is to retrieve the multi-manifold structure underlying a given Euclidean data set when this one is assumed to be obtained by sampling a distribution on a union of manifolds $\M = \M_1 \cup\dots \cup \M_N$ that may intersect with each other and that may have different dimensions. We investigate sufficient conditions that similarity graphs on data sets must satisfy in order for their corresponding graph Laplacians to capture the right geometric information to solve the MMC problem. Precisely, we provide high probability error bounds for the spectral approximation of a tensorized Laplacian on $\M$ with a suitable graph Laplacian built from the observations; the recovered tensorized Laplacian contains all geometric information of all the individual underlying manifolds. We provide an example of a family of similarity graphs, which we call annular proximity graphs with angle constraints, satisfying these sufficient conditions. We contrast our family of graphs with other constructions in the literature based on the alignment of tangent planes. Extensive numerical experiments expand the insights that our theory provides on the MMC problem.
On Tilted Losses in Machine Learning: Theory and Applications
http://jmlr.org/papers/v24/21-1095.html
http://jmlr.org/papers/volume24/21-1095/21-1095.pdf
2023Tian Li, Ahmad Beirami, Maziar Sanjabi, Virginia Smith
Exponential tilting is a technique commonly used in fields such as statistics, probability, information theory, and optimization to create parametric distribution shifts. Despite its prevalence in related fields, tilting has not seen widespread use in machine learning. In this work, we aim to bridge this gap by exploring the use of tilting in risk minimization. We study a simple extension to ERM---tilted empirical risk minimization (TERM)---which uses exponential tilting to flexibly tune the impact of individual losses. The resulting framework has several useful properties: We show that TERM can increase or decrease the influence of outliers, respectively, to enable fairness or robustness; has variance-reduction properties that can benefit generalization; and can be viewed as a smooth approximation to the tail probability of losses. Our work makes connections between TERM and related objectives, such as Value-at-Risk, Conditional Value-at-Risk, and distributionally robust optimization (DRO). We develop batch and stochastic first-order optimization methods for solving TERM, provide convergence guarantees for the solvers, and show that the framework can be efficiently solved relative to common alternatives. Finally, we demonstrate that TERM can be used for a multitude of applications in machine learning, such as enforcing fairness between subgroups, mitigating the effect of outliers, and handling class imbalance. Despite the straightforward modification TERM makes to traditional ERM objectives, we find that the framework can consistently outperform ERM and deliver competitive performance with state-of-the-art, problem-specific approaches.
Optimal Convergence Rates for Distributed Nystroem Approximation
http://jmlr.org/papers/v24/21-1049.html
http://jmlr.org/papers/volume24/21-1049/21-1049.pdf
2023Jian Li, Yong Liu, Weiping Wang
The distributed kernel ridge regression (DKRR) has shown great potential in processing complicated tasks. However, DKRR only made use of the local samples that failed to capture the global characteristics. Besides, the existing optimal learning guarantees were provided in expectation and only pertain to the attainable case that the target regression lies exactly in the kernel space. In this paper, we propose distributed learning with globally-shared Nystroem centers (DNystroem), which utilizes global information across the local clients. We also study the statistical properties of DNystroem in expectation and in probability, respectively, and obtain several state-of-the-art results with the minimax optimal learning rates. Note that, the optimal convergence rates for DNystroem pertain to the non-attainable case, while the statistical results allow more partitions and require fewer Nystroem centers. Finally, we conduct experiments on several real-world datasets to validate the effectiveness of the proposed algorithm, and the empirical results coincide with our theoretical findings.
Jump Interval-Learning for Individualized Decision Making with Continuous Treatments
http://jmlr.org/papers/v24/21-0843.html
http://jmlr.org/papers/volume24/21-0843/21-0843.pdf
2023Hengrui Cai, Chengchun Shi, Rui Song, Wenbin Lu
An individualized decision rule (IDR) is a decision function that assigns each individual a given treatment based on his/her observed characteristics. Most of the existing works in the literature consider settings with binary or finitely many treatment options. In this paper, we focus on the continuous treatment setting and propose a jump interval-learning to develop an individualized interval-valued decision rule (I2DR) that maximizes the expected outcome. Unlike IDRs that recommend a single treatment, the proposed I2DR yields an interval of treatment options for each individual, making it more flexible to implement in practice. To derive an optimal I2DR, our jump interval-learning method estimates the conditional mean of the outcome given the treatment and the covariates via jump penalized regression, and derives the corresponding optimal I2DR based on the estimated outcome regression function. The regressor is allowed to be either linear for clear interpretation or deep neural network to model complex treatment-covariates interactions. To implement jump interval-learning, we develop a searching algorithm based on dynamic programming that efficiently computes the outcome regression function. Statistical properties of the resulting I2DR are established when the outcome regression function is either a piecewise or continuous function over the treatment space. We further develop a procedure to infer the mean outcome under the (estimated) optimal policy. Extensive simulations and a real data application to a Warfarin study are conducted to demonstrate the empirical validity of the proposed I2DR.
Policy Gradient Methods Find the Nash Equilibrium in N-player General-sum Linear-quadratic Games
http://jmlr.org/papers/v24/21-0842.html
http://jmlr.org/papers/volume24/21-0842/21-0842.pdf
2023Ben Hambly, Renyuan Xu, Huining Yang
We consider a general-sum N-player linear-quadratic game with stochastic dynamics over a finite horizon and prove the global convergence of the natural policy gradient method to the Nash equilibrium. In order to prove convergence of the method we require a certain amount of noise in the system. We give a condition, essentially a lower bound on the covariance of the noise in terms of the model parameters, in order to guarantee convergence. We illustrate our results with numerical experiments to show that even in situations where the policy gradient method may not converge in the deterministic setting, the addition of noise leads to convergence.
Asymptotics of Network Embeddings Learned via Subsampling
http://jmlr.org/papers/v24/21-0841.html
http://jmlr.org/papers/volume24/21-0841/21-0841.pdf
2023Andrew Davison, Morgane Austern
Network data are ubiquitous in modern machine learning, with tasks of interest including node classification, node clustering and link prediction. A frequent approach begins by learning an Euclidean embedding of the network, to which algorithms developed for vector-valued data are applied. For large networks, embeddings are learned using stochastic gradient methods where the sub-sampling scheme can be freely chosen. Despite the strong empirical performance of such methods, they are not well understood theoretically. Our work encapsulates representation methods using a subsampling approach, such as node2vec, into a single unifying framework. We prove, under the assumption that the graph is exchangeable, that the distribution of the learned embedding vectors asymptotically decouples. Moreover, we characterize the asymptotic distribution and provided rates of convergence, in terms of the latent parameters, which includes the choice of loss function and the embedding dimension. This provides a theoretical foundation to understand what the embedding vectors represent and how well these methods perform on downstream tasks. Notably, we observe that typically used loss functions may lead to shortcomings, such as a lack of Fisher consistency.
Implicit Bias of Gradient Descent for Mean Squared Error Regression with Two-Layer Wide Neural Networks
http://jmlr.org/papers/v24/21-0832.html
http://jmlr.org/papers/volume24/21-0832/21-0832.pdf
2023Hui Jin, Guido Montufar
We investigate gradient descent training of wide neural networks and the corresponding implicit bias in function space. For univariate regression, we show that the solution of training a width-$n$ shallow ReLU network is within $n^{- 1/2}$ of the function which fits the training data and whose difference from the initial function has the smallest 2-norm of the second derivative weighted by a curvature penalty that depends on the probability distribution that is used to initialize the network parameters. We compute the curvature penalty function explicitly for various common initialization procedures. For instance, asymmetric initialization with a uniform distribution yields a constant curvature penalty, and thence the solution function is the natural cubic spline interpolation of the training data. For stochastic gradient descent we obtain the same implicit bias result. We obtain a similar result for different activation functions. For multivariate regression we show an analogous result, whereby the second derivative is replaced by the Radon transform of a fractional Laplacian. For initialization schemes that yield a constant penalty function, the solutions are polyharmonic splines. Moreover, we show that the training trajectories are captured by trajectories of smoothing splines with decreasing regularization strength.
Dimension Reduction in Contextual Online Learning via Nonparametric Variable Selection
http://jmlr.org/papers/v24/21-0818.html
http://jmlr.org/papers/volume24/21-0818/21-0818.pdf
2023Wenhao Li, Ningyuan Chen, L. Jeff Hong
We consider a contextual online learning (multi-armed bandit) problem with high-dimensional covariate $x$ and decision $y$. The reward function to learn, $f(x,y)$, does not have a particular parametric form. The literature has shown that the optimal regret is $\tilde{O}(T^{(d_x\!+\!d_y\!+\!1)/(d_x\!+\!d_y\!+\!2)})$, where $d_x$ and $d_y$ are the dimensions of $x$ and $y$, and thus it suffers from the curse of dimensionality. In many applications, only a small subset of variables in the covariate affect the value of $f$, which is referred to as sparsity in statistics. To take advantage of the sparsity structure of the covariate, we propose a variable selection algorithm called BV-LASSO, which incorporates novel ideas such as binning and voting to apply LASSO to nonparametric settings. Using it as a subroutine, we can achieve the regret $\tilde{O}(T^{(d_x^*\!+\!d_y\!+\!1)/(d_x^*\!+\!d_y\!+\!2)})$, where $d_x^*$ is the effective covariate dimension. The regret matches the optimal regret when the covariate is $d^*_x$-dimensional and thus cannot be improved. Our algorithm may serve as a general recipe to achieve dimension reduction via variable selection in nonparametric settings.
Sparse GCA and Thresholded Gradient Descent
http://jmlr.org/papers/v24/21-0745.html
http://jmlr.org/papers/volume24/21-0745/21-0745.pdf
2023Sheng Gao, Zongming Ma
Generalized correlation analysis (GCA) is concerned with uncovering linear relationships across multiple data sets. It generalizes canonical correlation analysis that is designed for two data sets. We study sparse GCA when there are potentially multiple leading generalized correlation tuples in data that are of interest and the loading matrix has a small number of nonzero rows. It includes sparse CCA and sparse PCA of correlation matrices as special cases. We first formulate sparse GCA as a generalized eigenvalue problem at both population and sample levels via a careful choice of normalization constraints. Based on a Lagrangian form of the sample optimization problem, we propose a thresholded gradient descent algorithm for estimating GCA loading vectors and matrices in high dimensions. We derive tight estimation error bounds for estimators generated by the algorithm with proper initialization. We also demonstrate the prowess of the algorithm on a number of synthetic data sets.
MARS: A Second-Order Reduction Algorithm for High-Dimensional Sparse Precision Matrices Estimation
http://jmlr.org/papers/v24/21-0699.html
http://jmlr.org/papers/volume24/21-0699/21-0699.pdf
2023Qian Li, Binyan Jiang, Defeng Sun
Estimation of the precision matrix (or inverse covariance matrix) is of great importance in statistical data analysis and machine learning. However, as the number of parameters scales quadratically with the dimension $p$, the computation becomes very challenging when $p$ is large. In this paper, we propose an adaptive sieving reduction algorithm to generate a solution path for the estimation of precision matrices under the $\ell_1$ penalized D-trace loss, with each subproblem being solved by a second-order algorithm. In each iteration of our algorithm, we are able to greatly reduce the number of variables in the problem based on the Karush-Kuhn-Tucker (KKT) conditions and the sparse structure of the estimated precision matrix in the previous iteration. As a result, our algorithm is capable of handling data sets with very high dimensions that may go beyond the capacity of the existing methods. Moreover, for the sub-problem in each iteration, other than solving the primal problem directly, we develop a semismooth Newton augmented Lagrangian algorithm with global linear convergence rate on the dual problem to improve the efficiency. Theoretical properties of our proposed algorithm have been established. In particular, we show that the convergence rate of our algorithm is asymptotically superlinear. The high efficiency and promising performance of our algorithm are illustrated via extensive simulation studies and real data applications, with comparison to several state-of-the-art solvers.
Exploiting Discovered Regression Discontinuities to Debias Conditioned-on-observable Estimators
http://jmlr.org/papers/v24/21-0670.html
http://jmlr.org/papers/volume24/21-0670/21-0670.pdf
2023Benjamin Jakubowski, Sriram Somanchi, Edward McFowland III, Daniel B. Neill
Regression discontinuity (RD) designs are widely used to estimate causal effects in the absence of a randomized experiment. However, standard approaches to RD analysis face two significant limitations. First, they require a priori knowledge of discontinuities in treatment. Second, they yield doubly-local treatment effect estimates, and fail to provide more general causal effect estimates away from the discontinuity. To address these limitations, we introduce a novel method for automatically detecting RDs at scale, integrating information from multiple discovered discontinuities with an observational estimator, and extrapolating away from discovered, local RDs. We demonstrate the performance of our method on two synthetic datasets, showing improved performance compared to direct use of an observational estimator, direct extrapolation of RD estimates, and existing methods for combining multiple causal effect estimates. Finally, we apply our novel method to estimate spatially heterogeneous treatment effects in the context of a recent economic development problem.
Generalized Linear Models in Non-interactive Local Differential Privacy with Public Data
http://jmlr.org/papers/v24/21-0523.html
http://jmlr.org/papers/volume24/21-0523/21-0523.pdf
2023Di Wang, Lijie Hu, Huanyu Zhang, Marco Gaboardi, Jinhui Xu
In this paper, we study the problem of estimating smooth Generalized Linear Models (GLMs) in the Non-interactive Local Differential Privacy (NLDP) model. Unlike its classical setting, our model allows the server to access additional public but unlabeled data. In the first part of the paper, we focus on GLMs. Specifically, we first consider the case where each data record is i.i.d. sampled from a zero-mean multivariate Gaussian distribution. Motivated by the Stein's lemma, we present an $(\epsilon, \delta)$-NLDP algorithm for GLMs. Moreover, the sample complexity of public and private data for the algorithm to achieve an $\ell_2$-norm estimation error of $\alpha$ (with high probability) is ${O}(p \alpha^{-2})$ and $\tilde{O}(p^3\alpha^{-2}\epsilon^{-2})$ respectively, where $p$ is the dimension of the feature vector. This is a significant improvement over the previously known exponential or quasi-polynomial in $\alpha^{-1}$, or exponential in $p$ sample complexities of GLMs with no public data. Then we consider a more general setting where each data record is i.i.d. sampled from some sub-Gaussian distribution with bounded $\ell_1$-norm. Based on a variant of Stein's lemma, we propose an $(\epsilon, \delta)$-NLDP algorithm for GLMs whose sample complexity of public and private data to achieve an $\ell_\infty$-norm estimation error of $\alpha$ is ${O}(p^2\alpha^{-2})$ and $\tilde{O}(p^2\alpha^{-2}\epsilon^{-2})$ respectively, under some mild assumptions and if $\alpha$ is not too small i.e., $\alpha\geq \Omega(\frac{1}{\sqrt{p}})$). In the second part of the paper, we extend our idea to the problem of estimating non-linear regressions and show similar results as in GLMs for both multivariate Gaussian and sub-Gaussian cases. Finally, we demonstrate the effectiveness of our algorithms through experiments on both synthetic and real-world datasets. To our best knowledge, this is the first paper showing the existence of efficient and effective algorithms for GLMs and non-linear regressions in the NLDP model with unlabeled public data.
A Rigorous Information-Theoretic Definition of Redundancy and Relevancy in Feature Selection Based on (Partial) Information Decomposition
http://jmlr.org/papers/v24/21-0482.html
http://jmlr.org/papers/volume24/21-0482/21-0482.pdf
2023Patricia Wollstadt, Sebastian Schmitt, Michael Wibral
Selecting a minimal feature set that is maximally informative about a target variable is a central task in machine learning and statistics. Information theory provides a powerful framework for formulating feature selection algorithms—yet, a rigorous, information-theoretic definition of feature relevancy, which accounts for feature interactions such as redundant and synergistic contributions, is still missing. We argue that this lack is inherent to classical information theory which does not provide measures to decompose the information a set of variables provides about a target into unique, redundant, and synergistic contributions. Such a decomposition has been introduced only recently by the partial information decomposition (PID) framework. Using PID, we clarify why feature selection is a conceptually difficult problem when approached using information theory and provide a novel definition of feature relevancy and redundancy in PID terms. From this definition, we show that the conditional mutual information (CMI) maximizes relevancy while minimizing redundancy and propose an iterative, CMI-based algorithm for practical feature selection. We demonstrate the power of our CMI-based algorithm in comparison to the unconditional mutual information on benchmark examples and provide corresponding PID estimates to highlight how PID allows to quantify information contribution of features and their interactions in feature-selection problems.
Combinatorial Optimization and Reasoning with Graph Neural Networks
http://jmlr.org/papers/v24/21-0449.html
http://jmlr.org/papers/volume24/21-0449/21-0449.pdf
2023Quentin Cappart, Didier Chételat, Elias B. Khalil, Andrea Lodi, Christopher Morris, Petar Veličković
Combinatorial optimization is a well-established area in operations research and computer science. Until recently, its methods have focused on solving problem instances in isolation, ignoring that they often stem from related data distributions in practice. However, recent years have seen a surge of interest in using machine learning, especially graph neural networks, as a key building block for combinatorial tasks, either directly as solvers or by enhancing exact solvers. The inductive bias of GNNs effectively encodes combinatorial and relational input due to their invariance to permutations and awareness of input sparsity. This paper presents a conceptual review of recent key advancements in this emerging field, aiming at optimization and machine learning researchers.
A First Look into the Carbon Footprint of Federated Learning
http://jmlr.org/papers/v24/21-0445.html
http://jmlr.org/papers/volume24/21-0445/21-0445.pdf
2023Xinchi Qiu, Titouan Parcollet, Javier Fernandez-Marques, Pedro P. B. Gusmao, Yan Gao, Daniel J. Beutel, Taner Topal, Akhil Mathur, Nicholas D. Lane
Despite impressive results, deep learning-based technologies also raise severe privacy and environmental concerns induced by the training procedure often conducted in data centers. In response, alternatives to centralized training such as Federated Learning (FL) have emerged. FL is now starting to be deployed at a global scale by companies that must adhere to new legal demands and policies originating from governments and social groups advocating for privacy protection. However, the potential environmental impact related to FL remains unclear and unexplored. This article offers the first-ever systematic study of the carbon footprint of FL. We propose a rigorous model to quantify the carbon footprint, hence facilitating the investigation of the relationship between FL design and carbon emissions. We also compare the carbon footprint of FL to traditional centralized learning. Our findings show that, depending on the configuration, FL can emit up to two orders of magnitude more carbon than centralized training. However, in certain settings, it can be comparable to centralized learning due to the reduced energy consumption of embedded devices. Finally, we highlight and connect the results to the future challenges and trends in FL to reduce its environmental impact, including algorithms efficiency, hardware capabilities, and stronger industry transparency.
An Eigenmodel for Dynamic Multilayer Networks
http://jmlr.org/papers/v24/21-0270.html
http://jmlr.org/papers/volume24/21-0270/21-0270.pdf
2023Joshua Daniel Loyal, Yuguo Chen
Dynamic multilayer networks frequently represent the structure of multiple co-evolving relations; however, statistical models are not well-developed for this prevalent network type. Here, we propose a new latent space model for dynamic multilayer networks. The key feature of our model is its ability to identify common time-varying structures shared by all layers while also accounting for layer-wise variation and degree heterogeneity. We establish the identifiability of the model's parameters and develop a structured mean-field variational inference approach to estimate the model's posterior, which scales to networks previously intractable to dynamic latent space models. We demonstrate the estimation procedure's accuracy and scalability on simulated networks. We apply the model to two real-world problems: discerning regional conflicts in a data set of international relations and quantifying infectious disease spread throughout a school based on the student's daily contact patterns.
Graph Clustering with Graph Neural Networks
http://jmlr.org/papers/v24/20-998.html
http://jmlr.org/papers/volume24/20-998/20-998.pdf
2023Anton Tsitsulin, John Palowitch, Bryan Perozzi, Emmanuel Müller
Graph Neural Networks (GNNs) have achieved state-of-the-art results on many graph analysis tasks such as node classification and link prediction. However, important unsupervised problems on graphs, such as graph clustering, have proved more resistant to advances in GNNs. Graph clustering has the same overall goal as node pooling in GNNs—does this mean that GNN pooling methods do a good job at clustering graphs? Surprisingly, the answer is no—current GNN pooling methods often fail to recover the cluster structure in cases where simple baselines, such as k-means applied on learned representations, work well. We investigate further by carefully designing a set of experiments to study different signal-to-noise scenarios both in graph structure and attribute data. To address these methods' poor performance in clustering, we introduce Deep Modularity Networks (DMoN), an unsupervised pooling method inspired by the modularity measure of clustering quality, and show how it tackles recovery of the challenging clustering structure of real-world graphs. Similarly, on real-world data, we show that DMoN produces high quality clusters which correlate strongly with ground truth labels, achieving state-of-the-art results with over 40% improvement over other pooling methods across different metrics.
Euler-Lagrange Analysis of Generative Adversarial Networks
http://jmlr.org/papers/v24/20-1390.html
http://jmlr.org/papers/volume24/20-1390/20-1390.pdf
2023Siddarth Asokan, Chandra Sekhar Seelamantula
We consider Generative Adversarial Networks (GANs) and address the underlying functional optimization problem ab initio within a variational setting. Strictly speaking, the optimization of the generator and discriminator functions must be carried out in accordance with the Euler-Lagrange conditions, which become particularly relevant in scenarios where the optimization cost involves regularizers comprising the derivatives of these functions. Considering Wasserstein GANs (WGANs) with a gradient-norm penalty, we show that the optimal discriminator is the solution to a Poisson differential equation. In principle, the optimal discriminator can be obtained in closed form without having to train a neural network. We illustrate this by employing a Fourier-series approximation to solve the Poisson differential equation. Experimental results based on synthesized Gaussian data demonstrate superior convergence behavior of the proposed approach in comparison with the baseline WGAN variants that employ weight-clipping, gradient or Lipschitz penalties on the discriminator on low-dimensional data. We also analyze the truncation error of the Fourier-series approximation and the estimation error of the Fourier coefficients in a high-dimensional setting. We demonstrate applications to real-world images considering latent-space prior matching in Wasserstein autoencoders and present performance comparisons on benchmark datasets such as MNIST, SVHN, CelebA, CIFAR-10, and Ukiyo-E. We demonstrate that the proposed approach achieves comparable reconstruction error and Frechet inception distance with faster convergence and up to two-fold improvement in image sharpness.
Statistical Robustness of Empirical Risks in Machine Learning
http://jmlr.org/papers/v24/20-1039.html
http://jmlr.org/papers/volume24/20-1039/20-1039.pdf
2023Shaoyan Guo, Huifu Xu, Liwei Zhang
This paper studies convergence of empirical risks in reproducing kernel Hilbert spaces (RKHS). A conventional assumption in the existing research is that empirical training data are generated by the unknown true probability distribution but this may not be satisfied in some practical circumstances. Consequently the existing convergence results may not provide a guarantee as to whether the empirical risks are reliable or not when the data are potentially corrupted (generated by a distribution perturbed from the true). In this paper, we fill out the gap from robust statistics perspective (Krätschmer, Schied and Zähle (2012); Krätschmer, Schied and Zähle (2014); Guo and Xu (2020). First, we derive moderate sufficient conditions under which the expected risk changes stably (continuously) against small perturbation of the probability distributions of the underlying random variables and demonstrate how the cost function and kernel affect the stability. Second, we examine the difference between laws of the statistical estimators of the expected optimal loss based on pure data and contaminated data using Prokhorov metric and Kantorovich metric, and derive some asymptotic qualitative and non-asymptotic quantitative statistical robustness results. Third, we identify appropriate metrics under which the statistical estimators are uniformly asymptotically consistent. These results provide theoretical grounding for analysing asymptotic convergence and examining reliability of the statistical estimators in a number of regression models.
HiGrad: Uncertainty Quantification for Online Learning and Stochastic Approximation
http://jmlr.org/papers/v24/18-521.html
http://jmlr.org/papers/volume24/18-521/18-521.pdf
2023Weijie J. Su, Yuancheng Zhu
Stochastic gradient descent (SGD) is an immensely popular approach for online learning in settings where data arrives in a stream or data sizes are very large. However, despite an ever-increasing volume of work on SGD, much less is known about the statistical inferential properties of SGD-based predictions. Taking a fully inferential viewpoint, this paper introduces a novel procedure termed HiGrad to conduct statistical inference for online learning, without incurring additional computational cost compared with SGD. The HiGrad procedure begins by performing SGD updates for a while and then splits the single thread into several threads, and this procedure hierarchically operates in this fashion along each thread. With predictions provided by multiple threads in place, a $t$-based confidence interval is constructed by decorrelating predictions using covariance structures given by a Donsker-style extension of the Ruppert--Polyak averaging scheme, which is a technical contribution of independent interest. Under certain regularity conditions, the HiGrad confidence interval is shown to attain asymptotically exact coverage probability. Finally, the performance of HiGrad is evaluated through extensive simulation studies and a real data example. An R package \texttt{higrad} has been developed to implement the method.
Benign overfitting in ridge regression
http://jmlr.org/papers/v24/22-1398.html
http://jmlr.org/papers/volume24/22-1398/22-1398.pdf
2023Alexander Tsigler, Peter L. Bartlett
In many modern applications of deep learning the neural network has many more parameters than the data points used for its training. Motivated by those practices, a large body of recent theoretical research has been devoted to studying overparameterized models. One of the central phenomena in this regime is the ability of the model to interpolate noisy data, but still have test error lower than the amount of noise in that data. arXiv:1906.11300 characterized for which covariance structure of the data such a phenomenon can happen in linear regression if one considers the interpolating solution with minimum $\ell_2$-norm and the data has independent components: they gave a sharp bound on the variance term and showed that it can be small if and only if the data covariance has high effective rank in a subspace of small co-dimension. We strengthen and complete their results by eliminating the independence assumption and providing sharp bounds for the bias term. Thus, our results apply in a much more general setting than those of arXiv:1906.11300, e.g., kernel regression, and not only characterize how the noise is damped but also which part of the true signal is learned. Moreover, we extend the result to the setting of ridge regression, which allows us to explain another interesting phenomenon: we give general sufficient conditions under which the optimal regularization is negative.
Compute-Efficient Deep Learning: Algorithmic Trends and Opportunities
http://jmlr.org/papers/v24/22-1208.html
http://jmlr.org/papers/volume24/22-1208/22-1208.pdf
2023Brian R. Bartoldson, Bhavya Kailkhura, Davis Blalock
Although deep learning has made great progress in recent years, the exploding economic and environmental costs of training neural networks are becoming unsustainable. To address this problem, there has been a great deal of research on *algorithmically-efficient deep learning*, which seeks to reduce training costs not at the hardware or implementation level, but through changes in the semantics of the training program. In this paper, we present a structured and comprehensive overview of the research in this field. First, we formalize the *algorithmic speedup* problem, then we use fundamental building blocks of algorithmically efficient training to develop a taxonomy. Our taxonomy highlights commonalities of seemingly disparate methods and reveals current research gaps. Next, we present evaluation best practices to enable comprehensive, fair, and reliable comparisons of speedup techniques. To further aid research and applications, we discuss common bottlenecks in the training pipeline (illustrated via experiments) and offer taxonomic mitigation strategies for them. Finally, we highlight some unsolved research challenges and present promising future directions.
Minimal Width for Universal Property of Deep RNN
http://jmlr.org/papers/v24/22-1191.html
http://jmlr.org/papers/volume24/22-1191/22-1191.pdf
2023Chang hoon Song, Geonho Hwang, Jun ho Lee, Myungjoo Kang
A recurrent neural network (RNN) is a widely used deep-learning network for dealing with sequential data. Imitating a dynamical system, an infinite-width RNN can approximate any open dynamical system in a compact domain. In general, deep narrow networks with bounded width and arbitrary depth are more effective than wide shallow networks with arbitrary width and bounded depth in practice; however, the universal approximation theorem for deep narrow structures has yet to be extensively studied. In this study, we prove the universality of deep narrow RNNs and show that the upper bound of the minimum width for universality can be independent of the length of the data. Specifically, we show a deep RNN with ReLU activation can approximate any continuous function or $L^p$ function with the widths $d_x+d_y+3$ and $\max\{d_x+1,d_y\}$, respectively, where the target function maps a finite sequence of vectors in $\mathbb{R}^{d_x}$ to a finite sequence of vectors in $\mathbb{R}^{d_y}$. We also compute the additional width required if the activation function is sigmoid or more. In addition, we prove the universality of other recurrent networks, such as bidirectional RNNs. Bridging a multi-layer perceptron and an RNN, our theory and technique can shed light on further research on deep RNNs.
Maximum likelihood estimation in Gaussian process regression is ill-posed
http://jmlr.org/papers/v24/22-1153.html
http://jmlr.org/papers/volume24/22-1153/22-1153.pdf
2023Toni Karvonen, Chris J. Oates
Gaussian process regression underpins countless academic and industrial applications of machine learning and statistics, with maximum likelihood estimation routinely used to select appropriate parameters for the covariance kernel. However, it remains an open problem to establish the circumstances in which maximum likelihood estimation is well-posed, that is, when the predictions of the regression model are insensitive to small perturbations of the data. This article identifies scenarios where the maximum likelihood estimator fails to be well-posed, in that the predictive distributions are not Lipschitz in the data with respect to the Hellinger distance. These failure cases occur in the noiseless data setting, for any Gaussian process with a stationary covariance function whose lengthscale parameter is estimated using maximum likelihood. Although the failure of maximum likelihood estimation is part of Gaussian process folklore, these rigorous theoretical results appear to be the first of their kind. The implication of these negative results is that well-posedness may need to be assessed post-hoc, on a case-by-case basis, when maximum likelihood estimation is used to train a Gaussian process model.
An Annotated Graph Model with Differential Degree Heterogeneity for Directed Networks
http://jmlr.org/papers/v24/22-1138.html
http://jmlr.org/papers/volume24/22-1138/22-1138.pdf
2023Stefan Stein, Chenlei Leng
Directed networks are conveniently represented as graphs in which ordered edges encode interactions between vertices. Despite their wide availability, there is a shortage of statistical models amenable for inference, specially when contextual information and degree heterogeneity are present. This paper presents an annotated graph model with parameters explicitly accounting for these features. To overcome the curse of dimensionality due to modelling degree heterogeneity, we introduce a sparsity assumption and propose a penalized likelihood approach with $\ell_1$-regularization for parameter estimation. We study the estimation and selection consistency of this approach under a sparse network assumption, and show that inference on the covariate parameter is straightforward, thus bypassing the need for the kind of debiasing commonly employed in $\ell_1$-penalized likelihood estimation. Simulation and data analysis corroborate our theoretical findings.
A Unified Framework for Optimization-Based Graph Coarsening
http://jmlr.org/papers/v24/22-1085.html
http://jmlr.org/papers/volume24/22-1085/22-1085.pdf
2023Manoj Kumar, Anurag Sharma, Sandeep Kumar
Graph coarsening is a widely used dimensionality reduction technique for approaching large-scale graph machine-learning problems. Given a large graph, graph coarsening aims to learn a smaller-tractable graph while preserving the properties of the originally given graph. Graph data consist of node features and graph matrix (e.g., adjacency and Laplacian). The existing graph coarsening methods ignore the node features and rely solely on a graph matrix to simplify graphs. In this paper, we introduce a novel optimization-based framework for graph dimensionality reduction. The proposed framework lies in the unification of graph learning and dimensionality reduction. It takes both the graph matrix and the node features as the input and learns the coarsen graph matrix and the coarsen feature matrix jointly while ensuring desired properties. The proposed optimization formulation is a multi-block non-convex optimization problem, which is solved efficiently by leveraging block majorization-minimization, $\log$ determinant, Dirichlet energy, and regularization frameworks. The proposed algorithms are provably convergent and practically amenable to numerous tasks. It is also established that the learned coarsened graph is $\epsilon\in(0,1)$ similar to the original graph. Extensive experiments elucidate the efficacy of the proposed framework for real-world applications.
Deep linear networks can benignly overfit when shallow ones do
http://jmlr.org/papers/v24/22-1065.html
http://jmlr.org/papers/volume24/22-1065/22-1065.pdf
2023Niladri S. Chatterji, Philip M. Long
We bound the excess risk of interpolating deep linear networks trained using gradient flow. In a setting previously used to establish risk bounds for the minimum $\ell_2$-norm interpolant, we show that randomly initialized deep linear networks can closely approximate or even match known bounds for the minimum $\ell_2$-norm interpolant. Our analysis also reveals that interpolating deep linear models have exactly the same conditional variance as the minimum $\ell_2$-norm solution. Since the noise affects the excess risk only through the conditional variance, this implies that depth does not improve the algorithm's ability to "hide the noise". Our simulations verify that aspects of our bounds reflect typical behavior for simple data distributions. We also find that similar phenomena are seen in simulations with ReLU networks, although the situation there is more nuanced.
SQLFlow: An Extensible Toolkit Integrating DB and AI
http://jmlr.org/papers/v24/22-1047.html
http://jmlr.org/papers/volume24/22-1047/22-1047.pdf
2023Jun Zhou, Ke Zhang, Lin Wang, Hua Wu, Yi Wang, ChaoChao Chen
Integrating AI algorithms into databases is an ongoing effort in both academia and industry. We introduce SQLFlow, a toolkit seamlessly combining data manipulations and AI operations that can be run locally or remotely. SQLFlow extends SQL syntax to support typical AI tasks including model training, inference, interpretation, and mathematical optimization. It is compatible with a variety of database management systems (DBMS) and AI engines, including MySQL, TiDB, MaxCompute, and Hive, as well as TensorFlow, scikit-learn, and XGBoost. Documentations and case studies are available at https://sqlflow.org. The source code and additional details can be found at https://github.com/sql-machine-learning/sqlflow.
Learning Good State and Action Representations for Markov Decision Process via Tensor Decomposition
http://jmlr.org/papers/v24/22-0917.html
http://jmlr.org/papers/volume24/22-0917/22-0917.pdf
2023Chengzhuo Ni, Yaqi Duan, Munther Dahleh, Mengdi Wang, Anru R. Zhang
The transition kernel of a continuous-state-action Markov decision process (MDP) admits a natural tensor structure. This paper proposes a tensor-inspired unsupervised learning method to identify meaningful low-dimensional state and action representations from empirical trajectories. The method exploits the MDP's tensor structure by kernelization, importance sampling and low-Tucker-rank approximation. This method can be further used to cluster states and actions respectively and find the best discrete MDP abstraction. We provide sharp statistical error bounds for tensor concentration and the preservation of diffusion distance after embedding. We further prove that the learned state/action abstractions provide accurate approximations to latent block structures if they exist, enabling function approximation in downstream tasks such as policy evaluation.
Generalization Bounds for Adversarial Contrastive Learning
http://jmlr.org/papers/v24/22-0866.html
http://jmlr.org/papers/volume24/22-0866/22-0866.pdf
2023Xin Zou, Weiwei Liu
Deep networks are well-known to be fragile to adversarial attacks, and adversarial training is one of the most popular methods used to train a robust model. To take advantage of unlabeled data, recent works have applied adversarial training to contrastive learning (Adversarial Contrastive Learning; ACL for short) and obtain promising robust performance. However, the theory of ACL is not well understood. To fill this gap, we leverage the Rademacher omplexity to analyze the generalization performance of ACL, with a particular focus on linear models and multi-layer neural networks under $\ell_p$ attack ($p \ge 1$). Our theory shows that the average adversarial risk of the downstream tasks can be upper bounded by the adversarial unsupervised risk of the upstream task. The experimental results validate our theory.
The Implicit Bias of Benign Overfitting
http://jmlr.org/papers/v24/22-0784.html
http://jmlr.org/papers/volume24/22-0784/22-0784.pdf
2023Ohad Shamir
The phenomenon of benign overfitting, where a predictor perfectly fits noisy training data while attaining near-optimal expected loss, has received much attention in recent years, but still remains not fully understood beyond well-specified linear regression setups. In this paper, we provide several new results on when one can or cannot expect benign overfitting to occur, for both regression and classification tasks. We consider a prototypical and rather generic data model for benign overfitting of linear predictors, where an arbitrary input distribution of some fixed dimension $k$ is concatenated with a high-dimensional distribution. For linear regression which is not necessarily well-specified, we show that the minimum-norm interpolating predictor (that standard training methods converge to) is biased towards an inconsistent solution in general, hence benign overfitting will generally *not* occur. Moreover, we show how this can be extended beyond standard linear regression, by an argument proving how the existence of benign overfitting on some regression problems precludes its existence on other regression problems. We then turn to classification problems, and show that the situation there is much more favorable. Specifically, we prove that the max-margin predictor (to which standard training methods are known to converge in direction) is asymptotically biased towards minimizing a weighted squared hinge loss. This allows us to reduce the question of benign overfitting in classification to the simpler question of whether this loss is a good surrogate for the misclassification error, and use it to show benign overfitting in some new settings.
The Hyperspherical Geometry of Community Detection: Modularity as a Distance
http://jmlr.org/papers/v24/22-0744.html
http://jmlr.org/papers/volume24/22-0744/22-0744.pdf
2023Martijn Gösgens, Remco van der Hofstad, Nelly Litvak
We introduce a metric space of clusterings, where clusterings are described by a binary vector indexed by the vertex-pairs. We extend this geometry to a hypersphere and prove that maximizing modularity is equivalent to minimizing the angular distance to some modularity vector over the set of clustering vectors. In that sense, modularity-based community detection methods can be seen as a subclass of a more general class of projection methods, which we define as the community detection methods that adhere to the following two-step procedure: first, mapping the network to a point on the hypersphere; second, projecting this point to the set of clustering vectors. We show that this class of projection methods contains many interesting community detection methods. Many of these new methods cannot be described in terms of null models and resolution parameters, as is customary for modularity-based methods. We provide a new characterization of such methods in terms of meridians and latitudes of the hypersphere. In addition, by relating the modularity resolution parameter to the latitude of the corresponding modularity vector, we obtain a new interpretation of the resolution limit that modularity maximization is known to suffer from.
FLIP: A Utility Preserving Privacy Mechanism for Time Series
http://jmlr.org/papers/v24/22-0734.html
http://jmlr.org/papers/volume24/22-0734/22-0734.pdf
2023Tucker McElroy, Anindya Roy, Gaurab Hore
Guaranteeing privacy in released data is an important goal for data-producing agencies. There has been extensive research on developing suitable privacy mechanisms in recent years. Particularly notable is the idea of noise addition with the guarantee of differential privacy. There are, however, concerns about compromising data utility when very stringent privacy mechanisms are applied. Such compromises can be quite stark in correlated data, such as time series data. Adding white noise to a stochastic process may significantly change the correlation structure, a facet of the process that is essential to optimal prediction. We propose the use of all-pass filtering as a privacy mechanism for regularly sampled time series data, showing that this procedure preserves certain types of utility while also providing sufficient privacy guarantees to entity-level time series. Numerical studies explore the practical performance of the new method, and an empirical application to labor force data show the method's favorable utility properties in comparison to other competing privacy mechanisms.
A General Theory for Federated Optimization with Asynchronous and Heterogeneous Clients Updates
http://jmlr.org/papers/v24/22-0689.html
http://jmlr.org/papers/volume24/22-0689/22-0689.pdf
2023Yann Fraboni, Richard Vidal, Laetitia Kameni, Marco Lorenzi
We propose a novel framework to study asynchronous federated learning optimization with delays in gradient updates. Our theoretical framework extends the standard FedAvg aggregation scheme by introducing stochastic aggregation weights to represent the variability of the clients update time, due for example to heterogeneous hardware capabilities. Our formalism applies to the general federated setting where clients have heterogeneous datasets and perform at least one step of stochastic gradient descent (SGD). We demonstrate convergence for such a scheme and provide sufficient conditions for the related minimum to be the optimum of the federated problem. We show that our general framework applies to existing optimization schemes including centralized learning, FedAvg, asynchronous FedAvg, and FedBuff. The theory here provided allows drawing meaningful guidelines for designing a federated learning experiment in heterogeneous conditions. In particular, we develop in this work FedFix, a novel extension of FedAvg enabling efficient asynchronous federated training while preserving the convergence stability of synchronous aggregation. We empirically demonstrate our theory on a series of experiments showing that asynchronous FedAvg leads to fast convergence at the expense of stability, and we finally demonstrate the improvements of FedFix over synchronous and asynchronous FedAvg.
Dimensionless machine learning: Imposing exact units equivariance
http://jmlr.org/papers/v24/22-0680.html
http://jmlr.org/papers/volume24/22-0680/22-0680.pdf
2023Soledad Villar, Weichi Yao, David W. Hogg, Ben Blum-Smith, Bianca Dumitrascu
Units equivariance (or units covariance) is the exact symmetry that follows from the requirement that relationships among measured quantities of physics relevance must obey self-consistent dimensional scalings. Here, we express this symmetry in terms of a (non-compact) group action, and we employ dimensional analysis and ideas from equivariant machine learning to provide a methodology for exactly units-equivariant machine learning: For any given learning task, we first construct a dimensionless version of its inputs using classic results from dimensional analysis and then perform inference in the dimensionless space. Our approach can be used to impose units equivariance across a broad range of machine learning methods that are equivariant to rotations and other groups. We discuss the in-sample and out-of-sample prediction accuracy gains one can obtain in contexts like symbolic regression and emulation, where symmetry is important. We illustrate our approach with simple numerical examples involving dynamical systems in physics and ecology.
Bayesian Calibration of Imperfect Computer Models using Physics-Informed Priors
http://jmlr.org/papers/v24/22-0676.html
http://jmlr.org/papers/volume24/22-0676/22-0676.pdf
2023Michail Spitieris, Ingelin Steinsland
We introduce a computational efficient data-driven framework suitable for quantifying the uncertainty in physical parameters and model formulation of computer models, represented by differential equations. We construct physics-informed priors, which are multi-output GP priors that encode the model's structure in the covariance function. This is extended into a fully Bayesian framework that quantifies the uncertainty of physical parameters and model predictions. Since physical models often are imperfect descriptions of the real process, we allow the model to deviate from the observed data by considering a discrepancy function. For inference Hamiltonian Monte Carlo is used. Further, approximations for big data are developed that reduce the computational complexity from $\mathcal{O}(N^3)$ to $\mathcal{O}(N\cdot m^2),$ where $m \ll N.$ Our approach is demonstrated in simulation and real data case studies where the physics are described by time-dependent ODEs (cardiovascular models) and space-time dependent PDEs (heat equation). In the studies, it is shown that our modelling framework can recover the true parameters of the physical models in cases where 1) the reality is more complex than our modelling choice and 2) the data acquisition process is biased while also producing accurate predictions. Furthermore, it is demonstrated that our approach is computationally faster than traditional Bayesian calibration methods.
Risk Bounds for Positive-Unlabeled Learning Under the Selected At Random Assumption
http://jmlr.org/papers/v24/22-067.html
http://jmlr.org/papers/volume24/22-067/22-067.pdf
2023Olivier Coudray, Christine Keribin, Pascal Massart, Patrick Pamphile
Positive-Unlabeled learning (PU learning) is a special case of semi-supervised binary classification where only a fraction of positive examples is labeled. The challenge is then to find the correct classifier despite this lack of information. Recently, new methodologies have been introduced to address the case where the probability of being labeled may depend on the covariates. In this paper, we are interested in establishing risk bounds for PU learning under this general assumption. In addition, we quantify the impact of label noise on PU learning compared to the standard classification setting. Finally, we provide a lower bound on the minimax risk proving that the upper bound is almost optimal.
Concentration analysis of multivariate elliptic diffusions
http://jmlr.org/papers/v24/22-0666.html
http://jmlr.org/papers/volume24/22-0666/22-0666.pdf
2023Lukas Trottner, Cathrine Aeckerle-Willems, Claudia Strauch
We prove concentration inequalities and associated PAC bounds for both continuous- and discrete-time additive functionals for possibly unbounded functions of multivariate, nonreversible diffusion processes. Our analysis relies on an approach via the Poisson equation allowing us to consider a very broad class of subexponentially ergodic, multivariate diffusion processes. These results add to existing concentration inequalities for additive functionals of diffusion processes which have so far been only available for either bounded functions or for unbounded functions of processes from a significantly smaller class. We demonstrate the power of these exponential inequalities by two examples of very different areas. Considering a possibly high-dimensional, parametric, nonlinear drift model under sparsity constraints we apply the continuous-time concentration results to validate the restricted eigenvalue condition for Lasso estimation, which is fundamental for the derivation of oracle inequalities. The results for discrete additive functionals are applied for an investigation of the unadjusted Langevin MCMC algorithm for sampling of moderately heavy tailed densities $\pi$. In particular, we provide PAC bounds for the sample Monte Carlo estimator of integrals $\pi(f)$ for polynomially growing functions $f$ that quantify sufficient sample and step sizes for approximation within a prescribed margin with high probability.
Knowledge Hypergraph Embedding Meets Relational Algebra
http://jmlr.org/papers/v24/22-063.html
http://jmlr.org/papers/volume24/22-063/22-063.pdf
2023Bahare Fatemi, Perouz Taslakian, David Vazquez, David Poole
Relational databases are a successful model for data storage, and rely on query languages for information retrieval. Most of these query languages are based on relational algebra, a mathematical formalization at the core of relational models. Knowledge graphs are flexible data storage structures that allow for knowledge completion using machine learning techniques. Knowledge hypergraphs generalize knowledge graphs by allowing multi-argument relations. This work studies knowledge hypergraph completion through the lens of relational algebra and its core operations. We explore the space between relational algebra foundations and machine learning techniques for knowledge completion. We investigate whether such methods can capture high-level abstractions in terms of relational algebra operations. We propose a simple embedding-based model called Relational Algebra Embedding (ReAlE) that performs link prediction in knowledge hypergraphs. We show theoretically that ReAlE is fully expressive and can represent the relational algebra operations of renaming, projection, set union, selection, and set difference. We verify experimentally that ReAlE outperforms state-of-the-art models in knowledge hypergraph completion, and in representing each of these primitive relational algebra operations. For the latter experiment, we generate a synthetic knowledge hypergraph, for which we design an algorithm based on the Erdos-R'enyi model for generating random graphs.
Intrinsic Gaussian Process on Unknown Manifolds with Probabilistic Metrics
http://jmlr.org/papers/v24/22-0627.html
http://jmlr.org/papers/volume24/22-0627/22-0627.pdf
2023Mu Niu, Zhenwen Dai, Pokman Cheung, Yizhu Wang
This article presents a novel approach to construct Intrinsic Gaussian Processes for regression on unknown manifolds with probabilistic metrics (GPUM) in point clouds. In many real world applications, one often encounters high dimensional data (e.g.‘point cloud data’) centered around some lower dimensional unknown manifolds. The geometry of manifold is in general different from the usual Euclidean geometry. Naively applying traditional smoothing methods such as Euclidean Gaussian Processes (GPs) to manifold-valued data and so ignoring the geometry of the space can potentially lead to highly misleading predictions and inferences. A manifold embedded in a high dimensional Euclidean space can be well described by a probabilistic mapping function and the corresponding latent space. We investigate the geometrical structure of the unknown manifolds using the Bayesian Gaussian Processes latent variable models(B-GPLVM) and Riemannian geometry. The distribution of the metric tensor is learned using B-GPLVM. The boundary of the resulting manifold is defined based on the uncertainty quantification of the mapping. We use the probabilistic metric tensor to simulate Brownian Motion paths on the unknown manifold. The heat kernel is estimated as the transition density of Brownian Motion and used as the covariance functions of GPUM. The applications of GPUM are illustrated in the simulation studies on the Swiss roll, high dimensional real datasets of WiFi signals and image data examples. Its performance is compared with the Graph Laplacian GP, Graph Mat\'{e}rn GP and Euclidean GP.
Sparse Training with Lipschitz Continuous Loss Functions and a Weighted Group L0-norm Constraint
http://jmlr.org/papers/v24/22-0615.html
http://jmlr.org/papers/volume24/22-0615/22-0615.pdf
2023Michael R. Metel
This paper is motivated by structured sparsity for deep neural network training. We study a weighted group $l_0$-norm constraint, and present the projection and normal cone of this set. Using randomized smoothing, we develop zeroth and first-order algorithms for minimizing a Lipschitz continuous function constrained by any closed set which can be projected onto. Non-asymptotic convergence guarantees are proven in expectation for the proposed algorithms for two related convergence criteria which can be considered as approximate stationary points. Two further methods are given using the proposed algorithms: one with non-asymptotic convergence guarantees in high probability, and the other with asymptotic guarantees to a stationary point almost surely. We believe in particular that these are the first such non-asymptotic convergence results for constrained Lipschitz continuous loss functions.
Learning Optimal Group-structured Individualized Treatment Rules with Many Treatments
http://jmlr.org/papers/v24/22-0520.html
http://jmlr.org/papers/volume24/22-0520/22-0520.pdf
2023Haixu Ma, Donglin Zeng, Yufeng Liu
Data driven individualized decision making problems have received a lot of attentions in recent years. In particular, decision makers aim to determine the optimal Individualized Treatment Rule (ITR) so that the expected specified outcome averaging over heterogeneous patient-specific characteristics is maximized. Many existing methods deal with binary or a moderate number of treatment arms and may not take potential treatment effect structure into account. However, the effectiveness of these methods may deteriorate when the number of treatment arms becomes large. In this article, we propose GRoup Outcome Weighted Learning (GROWL) to estimate the latent structure in the treatment space and the optimal group-structured ITRs through a single optimization. In particular, for estimating group-structured ITRs, we utilize the Reinforced Angle based Multicategory Support Vector Machines (RAMSVM) to learn group-based decision rules under the weighted angle based multi-class classification framework. Fisher consistency, the excess risk bound, and the convergence rate of the value function are established to provide a theoretical guarantee for GROWL. Extensive empirical results in simulation studies and real data analysis demonstrate that GROWL enjoys better performance than several other existing methods.
Inference for Gaussian Processes with Matern Covariogram on Compact Riemannian Manifolds
http://jmlr.org/papers/v24/22-0503.html
http://jmlr.org/papers/volume24/22-0503/22-0503.pdf
2023Didong Li, Wenpin Tang, Sudipto Banerjee
Gaussian processes are widely employed as versatile modelling and predictive tools in spatial statistics, functional data analysis, computer modelling and diverse applications of machine learning. They have been widely studied over Euclidean spaces, where they are specified using covariance functions or covariograms for modelling complex dependencies. There is a growing literature on Gaussian processes over Riemannian manifolds in order to develop richer and more flexible inferential frameworks for non-Euclidean data. While numerical approximations through graph representations have been well studied for the Matern covariogram and heat kernel, the behaviour of asymptotic inference on the parameters of the covariogram has received relatively scant attention. We focus on asymptotic behaviour for Gaussian processes constructed over compact Riemannian manifolds. Building upon a recently introduced Matern covariogram on a compact Riemannian manifold, we employ formal notions and conditions for the equivalence of two Matern Gaussian random measures on compact manifolds to derive the parameter that is identifiable, also known as the microergodic parameter, and formally establish the consistency of the maximum likelihood estimate and the asymptotic optimality of the best linear unbiased predictor. The circle is studied as a specific example of compact Riemannian manifolds with numerical experiments to illustrate and corroborate the theory.
FedLab: A Flexible Federated Learning Framework
http://jmlr.org/papers/v24/22-0440.html
http://jmlr.org/papers/volume24/22-0440/22-0440.pdf
2023Dun Zeng, Siqi Liang, Xiangjing Hu, Hui Wang, Zenglin Xu
FedLab is a lightweight open-source framework for the simulation of federated learning. The design of FedLab focuses on federated learning algorithm effectiveness and communication efficiency. It allows customization on server optimization, client optimization, communication agreement, and communication compression. Also, FedLab is scalable in different deployment scenarios with different computation and communication resources. We hope FedLab could provide flexible APIs as well as reliable baseline implementations and relieve the burden of implementing novel approaches for researchers in the FL community. The source code, tutorial, and documentation can be found at https://github.com/SMILELab-FL/FedLab.
Connectivity Matters: Neural Network Pruning Through the Lens of Effective Sparsity
http://jmlr.org/papers/v24/22-0415.html
http://jmlr.org/papers/volume24/22-0415/22-0415.pdf
2023Artem Vysogorets, Julia Kempe
Neural network pruning is a fruitful area of research with surging interest in high sparsity regimes. Benchmarking in this domain heavily relies on faithful representation of the sparsity of subnetworks, which has been traditionally computed as the fraction of removed connections (direct sparsity). This definition, however, fails to recognize unpruned parameters that detached from input or output layers of the underlying subnetworks, potentially underestimating actual effective sparsity: the fraction of inactivated connections. While this effect might be negligible for moderately pruned networks (up to 10–100 compression rates), we find that it plays an increasing role for sparser subnetworks, greatly distorting comparison between different pruning algorithms. For example, we show that effective compression of a randomly pruned LeNet-300-100 can be orders of magnitude larger than its direct counterpart, while no discrepancy is ever observed when using SynFlow for pruning (Tanaka et al., 2020). In this work, we adopt the lens of effective sparsity to reevaluate several recent pruning algorithms on common benchmark architectures (e.g., LeNet-300-100, VGG-19, ResNet-18) and discover that their absolute and relative performance changes dramatically in this new, and as we argue, more appropriate framework. To aim for effective, rather than direct, sparsity, we develop a low-cost extension to most pruning algorithms. Further, equipped with effective sparsity as a reference frame, we partially reconfirm that random pruning with appropriate sparsity allocation across layers performs as well or better than more sophisticated algorithms for pruning at initialization (Su et al., 2020). In response to this observation, using an analogy of pressure distribution in coupled cylinders from thermodynamics, we design novel layerwise sparsity quotas that outperform all existing baselines in the context of random pruning.
An Analysis of Robustness of Non-Lipschitz Networks
http://jmlr.org/papers/v24/22-0410.html
http://jmlr.org/papers/volume24/22-0410/22-0410.pdf
2023Maria-Florina Balcan, Avrim Blum, Dravyansh Sharma, Hongyang Zhang
Despite significant advances, deep networks remain highly susceptible to adversarial attack. One fundamental challenge is that small input perturbations can often produce large movements in the network’s final-layer feature space. In this paper, we define an attack model that abstracts this challenge, to help understand its intrinsic properties. In our model, the adversary may move data an arbitrary distance in feature space but only in random low-dimensional subspaces. We prove such adversaries can be quite powerful: defeating any algorithm that must classify any input it is given. However, by allowing the algorithm to abstain on unusual inputs, we show such adversaries can be overcome when classes are reasonably well-separated in feature space. We further provide strong theoretical guarantees for setting algorithm parameters to optimize over accuracy-abstention trade-offs using data-driven methods. Our results provide new robustness guarantees for nearest-neighbor style algorithms, and also have application to contrastive learning, where we empirically demonstrate the ability of such algorithms to obtain high robust accuracy with low abstention rates. Our model is also motivated by strategic classification, where entities being classified aim to manipulate their observable features to produce a preferred classification, and we provide new insights into that area as well.
Fitting Autoregressive Graph Generative Models through Maximum Likelihood Estimation
http://jmlr.org/papers/v24/22-0337.html
http://jmlr.org/papers/volume24/22-0337/22-0337.pdf
2023Xu Han, Xiaohui Chen, Francisco J. R. Ruiz, Li-Ping Liu
We consider the problem of fitting autoregressive graph generative models via maximum likelihood estimation (MLE). MLE is intractable for graph autoregressive models because the nodes in a graph can be arbitrarily reordered; thus the exact likelihood involves a sum over all possible node orders leading to the same graph. In this work, we fit the graph models by maximizing a variational bound, which is built by first deriving the joint probability over the graph and the node order of the autoregressive process. This approach avoids the need to specify ad-hoc node orders, since an inference network learns the most likely node sequences that have generated a given graph. We improve the approach by developing a graph generative model based on attention mechanisms and an inference network based on routing search. We demonstrate empirically that fitting autoregressive graph models via variational inference improves their qualitative and quantitative performance, and the improved model and inference network further boost the performance.
Global Convergence of Sub-gradient Method for Robust Matrix Recovery: Small Initialization, Noisy Measurements, and Over-parameterization
http://jmlr.org/papers/v24/22-0233.html
http://jmlr.org/papers/volume24/22-0233/22-0233.pdf
2023Jianhao Ma, Salar Fattahi
In this work, we study the performance of sub-gradient method (SubGM) on a natural nonconvex and nonsmooth formulation of low-rank matrix recovery with $\ell_1$-loss, where the goal is to recover a low-rank matrix from a limited number of measurements, a subset of which may be grossly corrupted with noise. We study a scenario where the rank of the true solution is unknown and over-estimated instead. The over-estimation of the rank gives rise to an over-parameterized model in which there are more degrees of freedom than needed. Such over-parameterization may lead to overfitting, or adversely affect the performance of the algorithm. We prove that a simple SubGM with small initialization is agnostic to both over-parameterization and noise in the measurements. In particular, we show that small initialization nullifies the effect of over-parameterization on the performance of SubGM, leading to an exponential improvement in its convergence rate. Moreover, we provide the first unifying framework for analyzing the behavior of SubGM under both outlier and Gaussian noise models, showing that SubGM converges to the true solution, even under arbitrarily large and arbitrarily dense noise values, and, perhaps surprisingly, even if the globally optimal solutions do not correspond to the ground truth. At the core of our results is a robust variant of restricted isometry property, called Sign-RIP, which controls the deviation of the sub-differential of the $\ell_1$-loss from that of an ideal, expected loss. As a byproduct of our results, we consider a subclass of robust low-rank matrix recovery with Gaussian measurements, and show that the number of required samples to guarantee the global convergence of SubGM is independent of the over-parameterized rank.
Statistical Inference for Noisy Incomplete Binary Matrix
http://jmlr.org/papers/v24/22-0214.html
http://jmlr.org/papers/volume24/22-0214/22-0214.pdf
2023Yunxiao Chen, Chengcheng Li, Jing Ouyang, Gongjun Xu
We consider the statistical inference for noisy incomplete binary (or 1-bit) matrix. Despite the importance of uncertainty quantification to matrix completion, most of the categorical matrix completion literature focuses on point estimation and prediction. This paper moves one step further toward statistical inference for binary matrix completion. Under a popular nonlinear factor analysis model, we obtain a point estimator and derive its asymptotic normality. Moreover, our analysis adopts a flexible missing-entry design that does not require a random sampling scheme as required by most of the existing asymptotic results for matrix completion. Under reasonable conditions, the proposed estimator is statistically efficient and optimal in the sense that the Cramer-Rao lower bound is achieved asymptotically for the model parameters. Two applications are considered, including (1) linking two forms of an educational test and (2) linking the roll call voting records from multiple years in the United States Senate. The first application enables the comparison between examinees who took different test forms, and the second application allows us to compare the liberal-conservativeness of senators who did not serve in the Senate at the same time.
Faith-Shap: The Faithful Shapley Interaction Index
http://jmlr.org/papers/v24/22-0202.html
http://jmlr.org/papers/volume24/22-0202/22-0202.pdf
2023Che-Ping Tsai, Chih-Kuan Yeh, Pradeep Ravikumar
Shapley values, which were originally designed to assign attributions to individual players in coalition games, have become a commonly used approach in explainable machine learning to provide attributions to input features for black-box machine learning models. A key attraction of Shapley values is that they uniquely satisfy a very natural set of axiomatic properties. However, extending the Shapley value to assigning attributions to interactions rather than individual players, an interaction index, is non-trivial: as the natural set of axioms for the original Shapley values, extended to the context of interactions, no longer specify a unique interaction index. Many proposals thus introduce additional possibly stringent axioms, while sacrificing the key axiom of efficiency, in order to obtain unique interaction indices. In this work, rather than introduce additional conflicting axioms, we adopt the viewpoint of Shapley values as coefficients of the most faithful linear approximation to the pseudo-Boolean coalition game value function. By extending linear to higher order polynomial approximations, we can then define the general family of faithful interaction indices. We show that by additionally requiring the faithful interaction indices to satisfy interaction-extensions of the standard individual Shapley axioms (dummy, symmetry, linearity, and efficiency), we obtain a unique Faithful Shapley Interaction index, which we denote Faith-Shap, as a natural generalization of the Shapley value to interactions. We then provide some illustrative contrasts of Faith-Shap with previously proposed interaction indices, and further investigate some of its interesting algebraic properties. We further show the computational efficiency of computing Faith-Shap, together with some additional qualitative insights, via some illustrative experiments.
Decentralized Learning: Theoretical Optimality and Practical Improvements
http://jmlr.org/papers/v24/22-0044.html
http://jmlr.org/papers/volume24/22-0044/22-0044.pdf
2023Yucheng Lu, Christopher De Sa
Decentralization is a promising method of scaling up parallel machine learning systems. In this paper, we provide a tight lower bound on the iteration complexity for such methods in a stochastic non-convex setting. Our lower bound reveals a theoretical gap in known convergence rates of many existing decentralized training algorithms, such as D-PSGD. We prove by construction this lower bound is tight and achievable. Motivated by our insights, we further propose DeTAG, a practical gossip-style decentralized algorithm that achieves the lower bound with only a logarithm gap. While a simple version of DeTAG with plain SGD and constant step size suffice for achieving theoretical limits, we additionally provide convergence bound for DeTAG under general non-increasing step size and momentum. Empirically, we compare DeTAG with other decentralized algorithms on multiple vision benchmarks, including CIFAR10/100 and ImageNet. We substantiate our theory and show DeTAG converges faster on unshuffled data and in sparse networks. Furthermore, we study a DeTAG variant, DeTAG*, that practically speeds up data-center-scale model training. This manuscript provides extended contents to its ICML version.
Non-Asymptotic Guarantees for Robust Statistical Learning under Infinite Variance Assumption
http://jmlr.org/papers/v24/22-0034.html
http://jmlr.org/papers/volume24/22-0034/22-0034.pdf
2023Lihu Xu, Fang Yao, Qiuran Yao, Huiming Zhang
There has been a surge of interest in developing robust estimators for models with heavy-tailed and bounded variance data in statistics and machine learning, while few works impose unbounded variance. This paper proposes two types of robust estimators, the ridge log-truncated M-estimator and the elastic net log-truncated M-estimator. The first estimator is applied to convex regressions such as quantile regression and generalized linear models, while the other one is applied to high dimensional non-convex learning problems such as regressions via deep neural networks. Simulations and real data analysis demonstrate the robustness of log-truncated estimations over standard estimations.
Recursive Quantile Estimation: Non-Asymptotic Confidence Bounds
http://jmlr.org/papers/v24/22-0021.html
http://jmlr.org/papers/volume24/22-0021/22-0021.pdf
2023Likai Chen, Georg Keilbar, Wei Biao Wu
This paper considers the recursive estimation of quantiles using the stochastic gradient descent (SGD) algorithm with Polyak-Ruppert averaging. The algorithm offers a computationally and memory efficient alternative to the usual empirical estimator. Our focus is on studying the non-asymptotic behavior by providing exponentially decreasing tail probability bounds under mild assumptions on the smoothness of the density functions. This novel non-asymptotic result is based on a bound of the moment generating function of the SGD estimate. We apply our result to the problem of best arm identification in a multi-armed stochastic bandit setting under quantile preferences.
Outlier-Robust Subsampling Techniques for Persistent Homology
http://jmlr.org/papers/v24/21-1526.html
http://jmlr.org/papers/volume24/21-1526/21-1526.pdf
2023Bernadette J. Stolz
In recent years, persistent homology has been successfully applied to real-world data in many different settings. Despite significant computational advances, persistent homology algorithms do not yet scale to large datasets preventing interesting applications. One approach to address computational issues posed by persistent homology is to select a set of landmarks by subsampling from the data. Currently, these landmark points are chosen either at random or using the maxmin algorithm. Neither is ideal as random selection tends to favour dense areas of the data while the maxmin algorithm is very sensitive to noise. Here, we propose a novel approach to select landmarks specifically for persistent homology that preserves coarse topological information of the original dataset. Our method is motivated by the Mayer-Vietoris sequence and requires only local persistent homology calculations thus enabling efficient computation. We test our landmarks on artificial data sets which contain different levels of noise and compare them to standard landmark selection techniques. We demonstrate that our landmark selection outperforms standard methods as well as a subsampling technique based on an outlier-robust version of the k-means algorithm for low sampling densities in noisy data with respect to robustness to outliers.
Neural Operator: Learning Maps Between Function Spaces With Applications to PDEs
http://jmlr.org/papers/v24/21-1524.html
http://jmlr.org/papers/volume24/21-1524/21-1524.pdf
2023Nikola Kovachki, Zongyi Li, Burigede Liu, Kamyar Azizzadenesheli, Kaushik Bhattacharya, Andrew Stuart, Anima Anandkumar
The classical development of neural networks has primarily focused on learning mappings between finite dimensional Euclidean spaces or finite sets. We propose a generalization of neural networks to learn operators, termed neural operators, that map between infinite dimensional function spaces. We formulate the neural operator as a composition of linear integral operators and nonlinear activation functions. We prove a universal approximation theorem for our proposed neural operator, showing that it can approximate any given nonlinear continuous operator. The proposed neural operators are also discretization-invariant, i.e., they share the same model parameters among different discretization of the underlying function spaces. Furthermore, we introduce four classes of efficient parameterization, viz., graph neural operators, multi-pole graph neural operators, low-rank neural operators, and Fourier neural operators. An important application for neural operators is learning surrogate maps for the solution operators of partial differential equations (PDEs). We consider standard PDEs such as the Burgers, Darcy subsurface flow, and the Navier-Stokes equations, and show that the proposed neural operators have superior performance compared to existing machine learning based methodologies, while being several orders of magnitude faster than conventional PDE solvers.
Dimension-Grouped Mixed Membership Models for Multivariate Categorical Data
http://jmlr.org/papers/v24/21-1513.html
http://jmlr.org/papers/volume24/21-1513/21-1513.pdf
2023Yuqi Gu, Elena E. Erosheva, Gongjun Xu, David B. Dunson
Mixed Membership Models (MMMs) are a popular family of latent structure models for complex multivariate data. Instead of forcing each subject to belong to a single cluster, MMMs incorporate a vector of subject-specific weights characterizing partial membership across clusters. With this flexibility come challenges in uniquely identifying, estimating, and interpreting the parameters. In this article, we propose a new class of Dimension-Grouped MMMs (Gro-M$^3$s) for multivariate categorical data, which improve parsimony and interpretability. In Gro-M$^3$s, observed variables are partitioned into groups such that the latent membership is constant for variables within a group but can differ across groups. Traditional latent class models are obtained when all variables are in one group, while traditional MMMs are obtained when each variable is in its own group. The new model corresponds to a novel decomposition of probability tensors. Theoretically, we derive transparent identifiability conditions for both the unknown grouping structure and model parameters in general settings. Methodologically, we propose a Bayesian approach for Dirichlet Gro-M$^3$s to inferring the variable grouping structure and estimating model parameters. Simulation results demonstrate good computational performance and empirically confirm the identifiability results. We illustrate the new methodology through applications to a functional disability survey dataset and a personality test dataset.
Gaussian Processes with Errors in Variables: Theory and Computation
http://jmlr.org/papers/v24/21-1480.html
http://jmlr.org/papers/volume24/21-1480/21-1480.pdf
2023Shuang Zhou, Debdeep Pati, Tianying Wang, Yun Yang, Raymond J. Carroll
Covariate measurement error in nonparametric regression is a common problem in nutritional epidemiology and geostatistics, and other fields. Over the last two decades, this problem has received substantial attention in the frequentist literature. Bayesian approaches for handling measurement error have only been explored recently and are surprisingly successful, although there still is a lack of a proper theoretical justification regarding the asymptotic performance of the estimators. By specifying a Gaussian process prior on the regression function and a Dirichlet process Gaussian mixture prior on the unknown distribution of the unobserved covariates, we show that the posterior distribution of the regression function and the unknown covariate density attain optimal rates of contraction adaptively over a range of Holder classes, up to logarithmic terms. We also develop a novel surrogate prior for approximating the Gaussian process prior that leads to efficient computation and preserves the covariance structure, thereby facilitating easy prior elicitation. We demonstrate the empirical performance of our approach and compare it with competitors in a wide range of simulation experiments and a real data example.
Learning Partial Differential Equations in Reproducing Kernel Hilbert Spaces
http://jmlr.org/papers/v24/21-1363.html
http://jmlr.org/papers/volume24/21-1363/21-1363.pdf
2023George Stepaniants
We propose a new data-driven approach for learning the fundamental solutions (Green's functions) of various linear partial differential equations (PDEs) given sample pairs of input-output functions. Building off the theory of functional linear regression (FLR), we estimate the best-fit Green's function and bias term of the fundamental solution in a reproducing kernel Hilbert space (RKHS) which allows us to regularize their smoothness and impose various structural constraints. We derive a general representer theorem for operator RKHSs to approximate the original infinite-dimensional regression problem by a finite-dimensional one, reducing the search space to a parametric class of Green's functions. In order to study the prediction error of our Green's function estimator, we extend prior results on FLR with scalar outputs to the case with functional outputs. Finally, we demonstrate our method on several linear PDEs including the Poisson, Helmholtz, Schrödinger, Fokker-Planck, and heat equation. We highlight its robustness to noise as well as its ability to generalize to new data with varying degrees of smoothness and mesh discretization without any additional training.
Doubly Robust Stein-Kernelized Monte Carlo Estimator: Simultaneous Bias-Variance Reduction and Supercanonical Convergence
http://jmlr.org/papers/v24/21-1313.html
http://jmlr.org/papers/volume24/21-1313/21-1313.pdf
2023Henry Lam, Haofeng Zhang
Standard Monte Carlo computation is widely known to exhibit a canonical square-root convergence speed in terms of sample size. Two recent techniques, one based on control variate and one on importance sampling, both derived from an integration of reproducing kernels and Stein's identity, have been proposed to reduce the error in Monte Carlo computation to supercanonical convergence. This paper presents a more general framework to encompass both techniques that is especially beneficial when the sample generator is biased and noise-corrupted. We show our general estimator, which we call the doubly robust Stein-kernelized estimator, outperforms both existing methods in terms of mean squared error rates across different scenarios. We also demonstrate the superior performance of our method via numerical examples.
Online Optimization over Riemannian Manifolds
http://jmlr.org/papers/v24/21-1308.html
http://jmlr.org/papers/volume24/21-1308/21-1308.pdf
2023Xi Wang, Zhipeng Tu, Yiguang Hong, Yingyi Wu, Guodong Shi
Online optimization has witnessed a massive surge of research attention in recent years. In this paper, we propose online gradient descent and online bandit algorithms over Riemannian manifolds in full information and bandit feedback settings respectively, for both geodesically convex and strongly geodesically convex functions. We establish a series of upper bounds on the regrets for the proposed algorithms over Hadamard manifolds. We also find a universal lower bound for achievable regret on Hadamard manifolds. Our analysis shows how time horizon, dimension, and sectional curvature bounds have impact on the regret bounds. When the manifold permits positive sectional curvature, we prove similar regret bound can be established by handling non-constrictive project maps. In addition, numerical studies on problems defined on symmetric positive definite matrix manifold, hyperbolic spaces, and Grassmann manifolds are provided to validate our theoretical findings, using synthetic and real-world data.
Bayes-Newton Methods for Approximate Bayesian Inference with PSD Guarantees
http://jmlr.org/papers/v24/21-1298.html
http://jmlr.org/papers/volume24/21-1298/21-1298.pdf
2023William J. Wilkinson, Simo Särkkä, Arno Solin
We formulate natural gradient variational inference (VI), expectation propagation (EP), and posterior linearisation (PL) as extensions of Newton's method for optimising the parameters of a Bayesian posterior distribution. This viewpoint explicitly casts inference algorithms under the framework of numerical optimisation. We show that common approximations to Newton's method from the optimisation literature, namely Gauss-Newton and quasi-Newton methods (e.g., the BFGS algorithm), are still valid under this 'Bayes-Newton' framework. This leads to a suite of novel algorithms which are guaranteed to result in positive semi-definite (PSD) covariance matrices, unlike standard VI and EP. Our unifying viewpoint provides new insights into the connections between various inference schemes. All the presented methods apply to any model with a Gaussian prior and non-conjugate likelihood, which we demonstrate with (sparse) Gaussian processes and state space models.
Iterated Block Particle Filter for High-dimensional Parameter Learning: Beating the Curse of Dimensionality
http://jmlr.org/papers/v24/21-1253.html
http://jmlr.org/papers/volume24/21-1253/21-1253.pdf
2023Ning Ning, Edward L. Ionides
Parameter learning for high-dimensional, partially observed, and nonlinear stochastic processes is a methodological challenge. Spatiotemporal disease transmission systems provide examples of such processes giving rise to open inference problems. We propose the iterated block particle filter (IBPF) algorithm for learning high-dimensional parameters over graphical state space models with general state spaces, measures, transition densities and graph structure. Theoretical performance guarantees are obtained on beating the curse of dimensionality (COD), algorithm convergence, and likelihood maximization. Experiments on a highly nonlinear and non-Gaussian spatiotemporal model for measles transmission reveal that the iterated ensemble Kalman filter algorithm (Li et al., 2020) is ineffective and the iterated filtering algorithm (Ionides et al., 2015) suffers from the COD, while our IBPF algorithm beats COD consistently across various experiments with different metrics.
Fast Online Changepoint Detection via Functional Pruning CUSUM Statistics
http://jmlr.org/papers/v24/21-1230.html
http://jmlr.org/papers/volume24/21-1230/21-1230.pdf
2023Gaetano Romano, Idris A. Eckley, Paul Fearnhead, Guillem Rigaill
Many modern applications of online changepoint detection require the ability to process high-frequency observations, sometimes with limited available computational resources. Online algorithms for detecting a change in mean often involve using a moving window, or specifying the expected size of change. Such choices affect which changes the algorithms have most power to detect. We introduce an algorithm, Functional Online CuSUM (FOCuS), which is equivalent to running these earlier methods simultaneously for all sizes of windows, or all possible values for the size of change. Our theoretical results give tight bounds on the expected computational cost per iteration of FOCuS, with this being logarithmic in the number of observations. We show how FOCuS can be applied to a number of different changes in mean scenarios, and demonstrate its practical utility through its state-of-the-art performance at detecting anomalous behaviour in computer server data.
Temporal Abstraction in Reinforcement Learning with the Successor Representation
http://jmlr.org/papers/v24/21-1213.html
http://jmlr.org/papers/volume24/21-1213/21-1213.pdf
2023Marlos C. Machado, Andre Barreto, Doina Precup, Michael Bowling
Reasoning at multiple levels of temporal abstraction is one of the key attributes of intelligence. In reinforcement learning, this is often modeled through temporally extended courses of actions called options. Options allow agents to make predictions and to operate at different levels of abstraction within an environment. Nevertheless, approaches based on the options framework often start with the assumption that a reasonable set of options is known beforehand. When this is not the case, there are no definitive answers for which options one should consider. In this paper, we argue that the successor representation, which encodes states based on the pattern of state visitation that follows them, can be seen as a natural substrate for the discovery and use of temporal abstractions. To support our claim, we take a big picture view of recent results, showing how the successor representation can be used to discover options that facilitate either temporally-extended exploration or planning. We cast these results as instantiations of a general framework for option discovery in which the agent’s representation is used to identify useful options, which are then used to further improve its representation. This results in a virtuous, never-ending, cycle in which both the representation and the options are constantly refined based on each other. Beyond option discovery itself, we also discuss how the successor representation allows us to augment a set of options into a combinatorially large counterpart without additional learning. This is achieved through the combination of previously learned options. Our empirical evaluation focuses on options discovered for temporally-extended exploration and on the use of the successor representation to combine them. Our results shed light on important design decisions involved in the definition of options and demonstrate the synergy of different methods based on the successor representation, such as eigenoptions and the option keyboard.
Approximate Post-Selective Inference for Regression with the Group LASSO
http://jmlr.org/papers/v24/21-1186.html
http://jmlr.org/papers/volume24/21-1186/21-1186.pdf
2023Snigdha Panigrahi, Peter W MacDonald, Daniel Kessler
After selection with the Group LASSO (or generalized variants such as the overlapping, sparse, or standardized Group LASSO), inference for the selected parameters is unreliable in the absence of adjustments for selection bias. In the penalized Gaussian regression setup, existing approaches provide adjustments for selection events that can be expressed as linear inequalities in the data variables. Such a representation, however, fails to hold for selection with the Group LASSO and substantially obstructs the scope of subsequent post-selective inference. Key questions of inferential interest, e.g., inference for the effects of selected variables on the outcome, remain unanswered. In the present paper, we develop a consistent, post-selective, Bayesian method to address the existing gaps by deriving a likelihood adjustment factor and an approximation thereof that eliminates bias from the selection of groups. Experiments on simulated data and data from the Human Connectome Project demonstrate that our method recovers the effects of parameters within the selected groups while paying only a small price for bias adjustment.
Towards Learning to Imitate from a Single Video Demonstration
http://jmlr.org/papers/v24/21-1174.html
http://jmlr.org/papers/volume24/21-1174/21-1174.pdf
2023Glen Berseth, Florian Golemo, Christopher Pal
Agents that can learn to imitate behaviours observed in video -- without having direct access to internal state or action information of the observed agent -- are more suitable for learning in the natural world. However, formulating a reinforcement learning (RL) agent that facilitates this goal remains a significant challenge. We approach this challenge using contrastive training to learn a reward function by comparing an agent's behaviour with a single demonstration. We use a Siamese recurrent neural network architecture to learn rewards in space and time between motion clips while training an RL policy to minimize this distance. Through experimentation, we also find that the inclusion of multi-task data and additional image encoding losses improve the temporal consistency of the learned rewards and, as a result, significantly improve policy learning. We demonstrate our approach on simulated humanoid, dog, and raptor agents in 2D and quadruped and humanoid agents in 3D. We show that our method outperforms current state-of-the-art techniques and can learn to imitate behaviours from a single video demonstration.
A Likelihood Approach to Nonparametric Estimation of a Singular Distribution Using Deep Generative Models
http://jmlr.org/papers/v24/21-1099.html
http://jmlr.org/papers/volume24/21-1099/21-1099.pdf
2023Minwoo Chae, Dongha Kim, Yongdai Kim, Lizhen Lin
We investigate statistical properties of a likelihood approach to nonparametric estimation of a singular distribution using deep generative models. More specifically, a deep generative model is used to model high-dimensional data that are assumed to concentrate around some low-dimensional structure. Estimating the distribution supported on this low-dimensional structure, such as a low-dimensional manifold, is challenging due to its singularity with respect to the Lebesgue measure in the ambient space. In the considered model, a usual likelihood approach can fail to estimate the target distribution consistently due to the singularity. We prove that a novel and effective solution exists by perturbing the data with an instance noise, which leads to consistent estimation of the underlying distribution with desirable convergence rates. We also characterize the class of distributions that can be efficiently estimated via deep generative models. This class is sufficiently general to contain various structured distributions such as product distributions, classically smooth distributions and distributions supported on a low-dimensional manifold. Our analysis provides some insights on how deep generative models can avoid the curse of dimensionality for nonparametric distribution estimation. We conduct a thorough simulation study and real data analysis to empirically demonstrate that the proposed data perturbation technique improves the estimation performance significantly.
A Randomized Subspace-based Approach for Dimensionality Reduction and Important Variable Selection
http://jmlr.org/papers/v24/21-1046.html
http://jmlr.org/papers/volume24/21-1046/21-1046.pdf
2023Di Bo, Hoon Hwangbo, Vinit Sharma, Corey Arndt, Stephanie TerMaath
An analysis of high-dimensional data can offer a detailed description of a system but is often challenged by the curse of dimensionality. General dimensionality reduction techniques can alleviate such difficulty by extracting a few important features, but they are limited due to the lack of interpretability and connectivity to actual decision making associated with each physical variable. Variable selection techniques, as an alternative, can maintain the interpretability, but they often involve a greedy search that is susceptible to failure in capturing important interactions or a metaheuristic search that requires extensive computations. This research proposes a novel method that identifies critical subspaces, reduced-dimensional physical spaces, to achieve dimensionality reduction and variable selection. We apply a randomized search for subspace exploration and leverage ensemble techniques to enhance model performance. When applied to high-dimensional data collected from the failure prediction of a composite/metal hybrid structure exhibiting complex progressive damage failure under loading, the proposed method outperforms the existing and potential alternatives in prediction and important variable selection.
Intrinsic Persistent Homology via Density-based Metric Learning
http://jmlr.org/papers/v24/21-1044.html
http://jmlr.org/papers/volume24/21-1044/21-1044.pdf
2023Ximena Fernández, Eugenio Borghini, Gabriel Mindlin, Pablo Groisman
We address the problem of estimating topological features from data in high dimensional Euclidean spaces under the manifold assumption. Our approach is based on the computation of persistent homology of the space of data points endowed with a sample metric known as Fermat distance. We prove that such metric space converges almost surely to the manifold itself endowed with an intrinsic metric that accounts for both the geometry of the manifold and the density that produces the sample. This fact implies the convergence of the associated persistence diagrams. The use of this intrinsic distance when computing persistent homology presents advantageous properties such as robustness to the presence of outliers in the input data and less sensitiveness to the particular embedding of the underlying manifold in the ambient space. We use these ideas to propose and implement a method for pattern recognition and anomaly detection in time series, which is evaluated in applications to real data.
Privacy-Aware Rejection Sampling
http://jmlr.org/papers/v24/21-0870.html
http://jmlr.org/papers/volume24/21-0870/21-0870.pdf
2023Jordan Awan, Vinayak Rao
While differential privacy (DP) offers strong theoretical privacy guarantees, implementations of DP mechanisms may be vulnerable to side-channel attacks, such as timing attacks. When sampling methods such as MCMC or rejection sampling are used to implement a privacy mechanism, the runtime can leak private information. We characterize the additional privacy cost due to the runtime of a rejection sampler in terms of both $(\epsilon,\delta)$-DP as well as $f$-DP. We also show that unless the acceptance probability is constant across databases, the runtime of a rejection sampler does not satisfy $\epsilon$-DP for any $\epsilon$. We show that there is a similar breakdown in privacy with adaptive rejection samplers. We propose three modifications to the rejection sampling algorithm, with varying assumptions, to protect against timing attacks by making the runtime independent of the data. The modification with the weakest assumptions is an approximate sampler, introducing a small increase in the privacy cost, whereas the other modifications give perfect samplers. We also use our techniques to develop an adaptive rejection sampler for log-Hölder densities, which also has data-independent runtime. We give several examples of DP mechanisms that fit the assumptions of our methods and can thus be implemented using our samplers.
Inference for a Large Directed Acyclic Graph with Unspecified Interventions
http://jmlr.org/papers/v24/21-0855.html
http://jmlr.org/papers/volume24/21-0855/21-0855.pdf
2023Chunlin Li, Xiaotong Shen, Wei Pan
Statistical inference of directed relations given some unspecified interventions (i.e., the intervention targets are unknown) is challenging. In this article, we test hypothesized directed relations with unspecified interventions. First, we derive conditions to yield an identifiable model. Unlike classical inference, testing directed relations requires identifying the ancestors and relevant interventions of hypothesis-specific primary variables. To this end, we propose a peeling algorithm based on nodewise regressions to establish a topological order of primary variables. Moreover, we prove that the peeling algorithm yields a consistent estimator in low-order polynomial time. Second, we propose a likelihood ratio test integrated with a data perturbation scheme to account for the uncertainty of identifying ancestors and interventions. Also, we show that the distribution of a data perturbation test statistic converges to the target distribution. Numerical examples demonstrate the utility and effectiveness of the proposed methods, including an application to infer gene regulatory networks.
How Do You Want Your Greedy: Simultaneous or Repeated?
http://jmlr.org/papers/v24/21-0782.html
http://jmlr.org/papers/volume24/21-0782/21-0782.pdf
2023Moran Feldman, Christopher Harshaw, Amin Karbasi
We present SimulatneousGreedys, a deterministic algorithm for constrained submodular maximization. At a high level, the algorithm maintains $\ell$ solutions and greedily updates them in a simultaneous fashion. SimultaneousGreedys achieves the tightest known approximation guarantees for both $k$-extendible systems and the more general $k$-systems, which are $(k+1)^2/k = k + \mathcal{O}(1)$ and $(1 + \sqrt{k+2})^2 = k + \mathcal{O}(\sqrt{k})$, respectively. We also improve the analysis of RepeatedGreedy, showing that it achieves an approximation ratio of $k + \mathcal{O}(\sqrt{k})$ for $k$-systems when allowed to run for $\mathcal{O}(\sqrt{k})$ iterations, an improvement in both the runtime and approximation over previous analyses. We demonstrate that both algorithms may be modified to run in nearly linear time with an arbitrarily small loss in the approximation. Both SimultaneousGreedys and RepeatedGreedy are flexible enough to incorporate the intersection of $m$ additional knapsack constraints, while retaining similar approximation guarantees: both algorithms yield an approximation guarantee of roughly $k + 2m + \mathcal{O}(\sqrt{k+m})$ for $k$-systems and SimultaneousGreedys enjoys an improved approximation guarantee of $k+2m + \mathcal{O}(\sqrt{m})$ for $k$-extendible systems. To complement our algorithmic contributions, we prove that no algorithm making polynomially many oracle queries can achieve an approximation better than $k + 1/2 - \epsilon$. We also present SubmodularGreedy.jl, a Julia package which implements these algorithms. Finally, we test these algorithms on real datasets.
Kernel-Matrix Determinant Estimates from stopped Cholesky Decomposition
http://jmlr.org/papers/v24/21-0781.html
http://jmlr.org/papers/volume24/21-0781/21-0781.pdf
2023Simon Bartels, Wouter Boomsma, Jes Frellsen, Damien Garreau
Algorithms involving Gaussian processes or determinantal point processes typically require computing the determinant of a kernel matrix. Frequently, the latter is computed from the Cholesky decomposition, an algorithm of cubic complexity in the size of the matrix. We show that, under mild assumptions, it is possible to estimate the determinant from only a sub-matrix, with probabilistic guarantee on the relative error. We present an augmentation of the Cholesky decomposition that stops under certain conditions before processing the whole matrix. Experiments demonstrate that this can save a considerable amount of time while rarely exceeding an overhead of more than 5% when not stopping early. More generally, we present a probabilistic stopping strategy for the approximation of a sum of known length where addends are revealed sequentially. We do not assume independence between addends, only that they are bounded from below and decrease in conditional expectation.
Optimizing ROC Curves with a Sort-Based Surrogate Loss for Binary Classification and Changepoint Detection
http://jmlr.org/papers/v24/21-0751.html
http://jmlr.org/papers/volume24/21-0751/21-0751.pdf
2023Jonathan Hillman, Toby Dylan Hocking
Receiver Operating Characteristic (ROC) curves are useful for evaluating binary classification models, but difficult to use for learning since the Area Under the Curve (AUC) is a piecewise constant function of predicted values. ROC curves can also be used in other problems with false positive and true positive rates such as changepoint detection. We show that in this more general context, the ROC curve can have loops, points with highly sub-optimal error rates, and AUC greater than one. This observation motivates a new optimization objective: rather than maximizing the AUC, we would like a monotonic ROC curve with AUC=1 that avoids points with large values for Min(FP,FN). We propose an L1 relaxation of this objective that results in a new surrogate loss function called the AUM, short for Area Under Min(FP, FN). Whereas previous loss functions are based on summing over all labeled examples or pairs, the AUM requires a sort and a sum over the sequence of points on the ROC curve. We show that AUM directional derivatives can be efficiently computed and used in a gradient descent learning algorithm. In our empirical study of supervised binary classification and changepoint detection problems, we show that our new AUM minimization learning algorithm results in improved AUC and speed relative to previous baselines.
When Locally Linear Embedding Hits Boundary
http://jmlr.org/papers/v24/21-0697.html
http://jmlr.org/papers/volume24/21-0697/21-0697.pdf
2023Hau-Tieng Wu, Nan Wu
Based on the Riemannian manifold model, we study the asymptotic behavior of a widely applied unsupervised learning algorithm, locally linear embedding (LLE), when the point cloud is sampled from a compact, smooth manifold with boundary. We show several peculiar behaviors of LLE near the boundary that are different from those diffusion-based algorithms. In particular, we show that LLE pointwisely converges to a mixed-type differential operator with degeneracy and we calculate the convergence rate. The impact of the hyperbolic part of the operator is discussed and we propose a clipped LLE algorithm which is a potential approach to recover the Dirichlet Laplace-Beltrami operator.
Distributed Nonparametric Regression Imputation for Missing Response Problems with Large-scale Data
http://jmlr.org/papers/v24/21-0673.html
http://jmlr.org/papers/volume24/21-0673/21-0673.pdf
2023Ruoyu Wang, Miaomiao Su, Qihua Wang
Nonparametric regression imputation is commonly used in missing data analysis. However, it suffers from the curse of dimension. The problem can be alleviated by the explosive sample size in the era of big data, while the large-scale data size presents some challenges in the storage of data and the calculation of estimators. These challenges make the classical nonparametric regression imputation methods no longer applicable. This motivates us to develop two distributed nonparametric regression imputation methods. One is based on kernel smoothing and the other on the sieve method. The kernel-based distributed imputation method has extremely low communication cost, and the sieve-based distributed imputation method can accommodate more local machines. The response mean estimation is considered to illustrate the proposed imputation methods. Two distributed nonparametric regression imputation estimators are proposed for the response mean, which are proved to be asymptotically normal with asymptotic variances achieving the semiparametric efficiency bound. The proposed methods are evaluated through simulation studies and illustrated in real data analysis.
Prior Specification for Bayesian Matrix Factorization via Prior Predictive Matching
http://jmlr.org/papers/v24/21-0623.html
http://jmlr.org/papers/volume24/21-0623/21-0623.pdf
2023Eliezer de Souza da Silva, Tomasz Kuśmierczyk, Marcelo Hartmann, Arto Klami
The behavior of many Bayesian models used in machine learning critically depends on the choice of prior distributions, controlled by some hyperparameters typically selected through Bayesian optimization or cross-validation. This requires repeated, costly, posterior inference. We provide an alternative for selecting good priors without carrying out posterior inference, building on the prior predictive distribution that marginalizes the model parameters. We estimate virtual statistics for data generated by the prior predictive distribution and then optimize over the hyperparameters to learn those for which the virtual statistics match the target values provided by the user or estimated from (a subset of) the observed data. We apply the principle for probabilistic matrix factorization, for which good solutions for prior selection have been missing. We show that for Poisson factorization models we can analytically determine the hyperparameters, including the number of factors, that best replicate the target statistics, and we empirically study the sensitivity of the approach for the model mismatch. We also present a model-independent procedure that determines the hyperparameters for general models by stochastic optimization and demonstrate this extension in the context of hierarchical matrix factorization models.
Posterior Contraction for Deep Gaussian Process Priors
http://jmlr.org/papers/v24/21-0556.html
http://jmlr.org/papers/volume24/21-0556/21-0556.pdf
2023Gianluca Finocchio, Johannes Schmidt-Hieber
We study posterior contraction rates for a class of deep Gaussian process priors in the nonparametric regression setting under a general composition assumption on the regression function. It is shown that the contraction rates can achieve the minimax convergence rate (up to log n factors), while being adaptive to the underlying structure and smoothness of the target function. The proposed framework extends the Bayesian nonparametric theory for Gaussian process priors.
Wide-minima Density Hypothesis and the Explore-Exploit Learning Rate Schedule
http://jmlr.org/papers/v24/21-0549.html
http://jmlr.org/papers/volume24/21-0549/21-0549.pdf
2023Nikhil Iyer, V. Thejas, Nipun Kwatra, Ramachandran Ramjee, Muthian Sivathanu
Several papers argue that wide minima generalize better than narrow minima. In this paper, through detailed experiments that not only corroborate the generalization properties of wide minima, we also provide empirical evidence for a new hypothesis that the density of wide minima is likely lower than the density of narrow minima. Further, motivated by this hypothesis, we design a novel explore-exploit learning rate schedule. On a variety of image and natural language datasets, compared to their original hand-tuned learning rate baselines, we show that our explore-exploit schedule can result in either up to 0.84% higher absolute accuracy using the original training budget or up to 57% reduced training time while achieving the original reported accuracy.
Fundamental limits and algorithms for sparse linear regression with sublinear sparsity
http://jmlr.org/papers/v24/21-0543.html
http://jmlr.org/papers/volume24/21-0543/21-0543.pdf
2023Lan V. Truong
We establish exact asymptotic expressions for the normalized mutual information and minimum mean-square-error (MMSE) of sparse linear regression in the sub-linear sparsity regime. Our result is achieved by a generalization of the adaptive interpolation method in Bayesian inference for linear regimes to sub-linear ones. A modification of the well-known approximate message passing algorithm to approach the MMSE fundamental limit is also proposed, and its state evolution is rigorously analysed. Our results show that the traditional linear assumption between the signal dimension and number of observations in the replica and adaptive interpolation methods is not necessary for sparse signals. They also show how to modify the existing well-known AMP algorithms for linear regimes to sub-linear ones.
On the Complexity of SHAP-Score-Based Explanations: Tractability via Knowledge Compilation and Non-Approximability Results
http://jmlr.org/papers/v24/21-0389.html
http://jmlr.org/papers/volume24/21-0389/21-0389.pdf
2023Marcelo Arenas, Pablo Barcelo, Leopoldo Bertossi, Mikael Monet
Scores based on Shapley values are widely used for providing explanations to classification results over machine learning models. A prime example of this is the influential~ Shap-score, a version of the Shapley value that can help explain the result of a learned model on a specific entity by assigning a score to every feature. While in general computing Shapley values is a computationally intractable problem, we prove a strong positive result stating that the Shap-score can be computed in polynomial time over deterministic and decomposable Boolean circuits under the so-called product distributions on entities. Such circuits are studied in the field of Knowledge Compilation and generalize a wide range of Boolean circuits and binary decision diagrams classes, including binary decision trees, Ordered Binary Decision Diagrams (OBDDs) and Free Binary Decision Diagrams (FBDDs). Our positive result extends even beyond binary classifiers, as it continues to hold if each feature is associated with a finite domain of possible values. We also establish the computational limits of the notion of Shap-score by observing that, under a mild condition, computing it over a class of Boolean models is always polynomially as hard as the model counting problem for that class. This implies that both determinism and decomposability are essential properties for the circuits that we consider, as removing one or the other renders the problem of computing the Shap-score intractable (namely, $\#P$-hard). It also implies that computing Shap-scores is $\#P$-hard even over the class of propositional formulas in DNF. Based on this negative result, we look for the existence of fully-polynomial randomized approximation schemes (FPRAS) for computing Shap-scores over such class. In stark contrast to the model counting problem for DNF formulas, which admits an FPRAS, we prove that no such FPRAS exists (under widely believed complexity assumptions) for the computation of Shap-scores. Surprisingly, this negative result holds even for the class of monotone formulas in DNF. These techniques can be further extended to prove another strong negative result: Under widely believed complexity assumptions, there is no polynomial-time algorithm that checks, given a monotone DNF formula $\varphi$ and features $x,y$, whether the Shap-score of $x$ in $\varphi$ is smaller than the Shap-score of $y$ in $\varphi$.
Monotonic Alpha-divergence Minimisation for Variational Inference
http://jmlr.org/papers/v24/21-0249.html
http://jmlr.org/papers/volume24/21-0249/21-0249.pdf
2023Kamélia Daudel, Randal Douc, François Roueff
In this paper, we introduce a novel family of iterative algorithms which carry out $\alpha$-divergence minimisation in a Variational Inference context. They do so by ensuring a systematic decrease at each step in the $\alpha$-divergence between the variational and the posterior distributions. In its most general form, the variational distribution is a mixture model and our framework allows us to simultaneously optimise the weights and components parameters of this mixture model. Our approach permits us to build on various methods previously proposed for $\alpha$-divergence minimisation such as Gradient or Power Descent schemes and we also shed a new light on an integrated Expectation Maximization algorithm. Lastly, we provide empirical evidence that our methodology yields improved results on several multimodal target distributions and on a real data example.
Density estimation on low-dimensional manifolds: an inflation-deflation approach
http://jmlr.org/papers/v24/21-0235.html
http://jmlr.org/papers/volume24/21-0235/21-0235.pdf
2023Christian Horvat, Jean-Pascal Pfister
Normalizing flows (NFs) are universal density estimators based on neural networks. However, this universality is limited: the density's support needs to be diffeomorphic to a Euclidean space. In this paper, we propose a novel method to overcome this limitation without sacrificing universality. The proposed method inflates the data manifold by adding noise in the normal space, trains an NF on this inflated manifold, and, finally, deflates the learned density. Our main result provides sufficient conditions on the manifold and the specific choice of noise under which the corresponding estimator is exact. Our method has the same computational complexity as NFs and does not require computing an inverse flow. We also demonstrate theoretically (under certain conditions) and empirically (on a wide range of toy examples) that noise in the normal space can be well approximated by Gaussian noise. This allows using our method for approximating arbitrary densities on unknown manifolds provided that the manifold dimension is known.
Provably Sample-Efficient Model-Free Algorithm for MDPs with Peak Constraints
http://jmlr.org/papers/v24/21-0117.html
http://jmlr.org/papers/volume24/21-0117/21-0117.pdf
2023Qinbo Bai, Vaneet Aggarwal, Ather Gattami
In the optimization of dynamic systems, the variables typically have constraints. Such problems can be modeled as a Constrained Markov Decision Process (CMDP). This paper considers the peak Constrained Markov Decision Process (PCMDP), where the agent chooses the policy to maximize total reward in the finite horizon as well as satisfy constraints at each epoch with probability 1. We propose a model-free algorithm that converts PCMDP problem to an unconstrained problem and a Q-learning based approach is applied. We define the concept of probably approximately correct (PAC) to the proposed PCMDP problem. The proposed algorithm is proved to achieve an $(\epsilon,p)$-PAC policy when the episode $K\geq\Omega(\frac{I^2H^6SA\ell}{\epsilon^2})$, where $S$ and $A$ are the number of states and actions, respectively. $H$ is the number of epochs per episode. $I$ is the number of constraint functions, and $\ell=\log(\frac{SAT}{p})$. We note that this is the first result on PAC kind of analysis for PCMDP with peak constraints, where the transition dynamics are not known apriori. We demonstrate the proposed algorithm on an energy harvesting problem and a single machine scheduling problem, where it performs close to the theoretical upper bound of the studied optimization problem.
Topological Convolutional Layers for Deep Learning
http://jmlr.org/papers/v24/21-0073.html
http://jmlr.org/papers/volume24/21-0073/21-0073.pdf
2023Ephy R. Love, Benjamin Filippenko, Vasileios Maroulas, Gunnar Carlsson
This work introduces the Topological CNN (TCNN), which encompasses several topologically defined convolutional methods. Manifolds with important relationships to the natural image space are used to parameterize image filters which are used as convolutional weights in a TCNN. These manifolds also parameterize slices in layers of a TCNN across which the weights are localized. We show evidence that TCNNs learn faster, on less data, with fewer learned parameters, and with greater generalizability and interpretability than conventional CNNs. We introduce and explore TCNN layers for both image and video data. We propose extensions to 3D images and 3D video.
Online Stochastic Gradient Descent with Arbitrary Initialization Solves Non-smooth, Non-convex Phase Retrieval
http://jmlr.org/papers/v24/20-902.html
http://jmlr.org/papers/volume24/20-902/20-902.pdf
2023Yan Shuo Tan, Roman Vershynin
In recent literature, a general two step procedure has been formulated for solving the problem of phase retrieval. First, a spectral technique is used to obtain a constant-error initial estimate, following which, the estimate is refined to arbitrary precision by first-order optimization of a non-convex loss function. Numerical experiments, however, seem to suggest that simply running the iterative schemes from a random initialization may also lead to convergence, albeit at the cost of slightly higher sample complexity. In this paper, we prove that, in fact, constant step size online stochastic gradient descent (SGD) converges from arbitrary initializations for the non-smooth, non-convex amplitude squared loss objective. In this setting, online SGD is also equivalent to the randomized Kaczmarz algorithm from numerical analysis. Our analysis can easily be generalized to other single index models. It also makes use of new ideas from stochastic process theory, including the notion of a summary state space, which we believe will be of use for the broader field of non-convex optimization.
Tree-AMP: Compositional Inference with Tree Approximate Message Passing
http://jmlr.org/papers/v24/20-695.html
http://jmlr.org/papers/volume24/20-695/20-695.pdf
2023Antoine Baker, Florent Krzakala, Benjamin Aubin, Lenka Zdeborová
We introduce Tree-AMP, standing for Tree Approximate Message Passing, a python package for compositional inference in high-dimensional tree-structured models. The package provides a unifying framework to study several approximate message passing algorithms previously derived for a variety of machine learning tasks such as generalized linear models, inference in multi-layer networks, matrix factorization, and reconstruction using non-separable penalties. For some models, the asymptotic performance of the algorithm can be theoretically predicted by the state evolution, and the measurements entropy estimated by the free entropy formalism. The implementation is modular by design: each module, which implements a factor, can be composed at will with other modules to solve complex inference tasks. The user only needs to declare the factor graph of the model: the inference algorithm, state evolution and entropy estimation are fully automated.
On the geometry of Stein variational gradient descent
http://jmlr.org/papers/v24/20-602.html
http://jmlr.org/papers/volume24/20-602/20-602.pdf
2023Andrew Duncan, Nikolas Nüsken, Lukasz Szpruch
Bayesian inference problems require sampling or approximating high-dimensional probability distributions. The focus of this paper is on the recently introduced Stein variational gradient descent methodology, a class of algorithms that rely on iterated steepest descent steps with respect to a reproducing kernel Hilbert space norm. This construction leads to interacting particle systems, the mean field limit of which is a gradient flow on the space of probability distributions equipped with a certain geometrical structure. We leverage this viewpoint to shed some light on the convergence properties of the algorithm, in particular addressing the problem of choosing a suitable positive definite kernel function. Our analysis leads us to considering certain nondifferentiable kernels with adjusted tails. We demonstrate significant performance gains of these in various numerical experiments.
Kernel-based estimation for partially functional linear model: Minimax rates and randomized sketches
http://jmlr.org/papers/v24/20-1461.html
http://jmlr.org/papers/volume24/20-1461/20-1461.pdf
2023Shaogao Lv, Xin He, Junhui Wang
This paper considers the partially functional linear model (PFLM) where all predictive features consist of a functional covariate and a high dimensional scalar vector. Over an infinite dimensional reproducing kernel Hilbert space, the proposed estimation for PFLM is a least square approach with two mixed regularizations of a function-norm and an $\ell_1$-norm. Our main task in this paper is to establish the minimax rates for PFLM under high dimensional setting, and the optimal minimax rates of estimation are established by using various techniques in empirical process theory for analyzing kernel classes. In addition, we propose an efficient numerical algorithm based on randomized sketches of the kernel matrix. Several numerical experiments are implemented to support our method and optimization strategy.
Contextual Stochastic Block Model: Sharp Thresholds and Contiguity
http://jmlr.org/papers/v24/20-1419.html
http://jmlr.org/papers/volume24/20-1419/20-1419.pdf
2023Chen Lu, Subhabrata Sen
We study community detection in the “contextual stochastic block model" (Yan and Sarkar (2020), Deshpande et al. (2018)). Deshpande et al. (2018) studied this problem in the setting of sparse graphs with high-dimensional node-covariates. Using the non-rigorous “cavity method" from statistical physics (Mezard and Montanari (2009)), they calculated the sharp limit for community detection in this setting, and verified that the limit matches the information theoretic threshold when the average degree of the observed graph is large. They conjectured that the limit should hold as soon as the average degree exceeds one. We establish this conjecture, and characterize the sharp threshold for detection and weak recovery.
VCG Mechanism Design with Unknown Agent Values under Stochastic Bandit Feedback
http://jmlr.org/papers/v24/20-1226.html
http://jmlr.org/papers/volume24/20-1226/20-1226.pdf
2023Kirthevasan Kandasamy, Joseph E Gonzalez, Michael I Jordan, Ion Stoica
We study a multi-round welfare-maximising mechanism design problem in instances where agents do not know their values. On each round, a mechanism first assigns an allocation to a set of agents and charges them a price; at the end of the round, the agents provide (stochastic) feedback to the mechanism for the allocation they received. This setting is motivated by applications in cloud markets and online advertising where an agent may know her value for an allocation only after experiencing it. Therefore, the mechanism needs to explore different allocations for each agent so that it can learn their values, while simultaneously attempting to find the socially optimal set of allocations. Our focus is on truthful and individually rational mechanisms which imitate the classical VCG mechanism in the long run. To that end, we first define three notions of regret for the welfare, the individual utilities of each agent and that of the mechanism. We show that these three terms are interdependent via an $\Omega(T^{\frac{2}{3}})$ lower bound for the maximum of these three terms after $T$ rounds of allocations, and describe an algorithm which essentially achieves this rate. Our framework also provides flexibility to control the pricing scheme so as to trade-off between the agent and seller regrets. Next, we define asymptotic variants for the truthfulness and individual rationality requirements and provide asymptotic rates to quantify the degree to which both properties are satisfied by the proposed algorithm.
Necessary and Sufficient Conditions for Inverse Reinforcement Learning of Bayesian Stopping Time Problems
http://jmlr.org/papers/v24/20-1202.html
http://jmlr.org/papers/volume24/20-1202/20-1202.pdf
2023Kunal Pattanayak, Vikram Krishnamurthy
This paper presents an inverse reinforcement learning (IRL) framework for Bayesian stopping time problems. By observing the actions of a Bayesian decision maker, we provide a necessary and sufficient condition to identify if these actions are consistent with optimizing a cost function. In a Bayesian (partially observed) setting, the inverse learner can at best identify optimality wrt the observed strategies. Our IRL algorithm identifies optimality and then constructs set-valued estimates of the cost function. To achieve this IRL objective, we use novel ideas from Bayesian revealed preferences stemming from microeconomics. We illustrate the proposed IRL scheme using two important examples of stopping time problems, namely, sequential hypothesis testing and Bayesian search. As a real-world example, we illustrate using a YouTube dataset comprising metadata from 190000 videos how the proposed IRL method predicts user engagement in online multimedia platforms with high accuracy. Finally, for finite datasets, we propose an IRL detection algorithm and give finite sample bounds on its error probabilities.
Online Change-Point Detection in High-Dimensional Covariance Structure with Application to Dynamic Networks
http://jmlr.org/papers/v24/20-1101.html
http://jmlr.org/papers/volume24/20-1101/20-1101.pdf
2023Lingjun Li, Jun Li
In this paper, we develop an online change-point detection procedure in the covariance structure of high-dimensional data. A new stopping rule is proposed to terminate the process as early as possible when a change in covariance structure occurs. The stopping rule allows spatial and temporal dependence and can be applied to non-Gaussian data. An explicit expression for the average run length is derived, so that the level of threshold in the stopping rule can be easily obtained with no need to run time-consuming Monte Carlo simulations. We also establish an upper bound for the expected detection delay, the expression of which demonstrates the impact of data dependence and magnitude of change in the covariance structure. Simulation studies are provided to confirm accuracy of the theoretical results. The practical usefulness of the proposed procedure is illustrated by detecting the change of brain’s covariance network in a resting-state fMRI data set. The implementation of the methodology is provided in the R package OnlineCOV.
Convergence Rates of a Class of Multivariate Density Estimation Methods Based on Adaptive Partitioning
http://jmlr.org/papers/v24/20-060.html
http://jmlr.org/papers/volume24/20-060/20-060.pdf
2023Linxi Liu, Dangna Li, Wing Hung Wong
Density estimation is a building block for many other statistical methods, such as classification, nonparametric testing, and data compression. In this paper, we focus on a non-parametric approach to multivariate density estimation, and study its asymptotic properties under both frequentist and Bayesian settings. The estimated density function is obtained by considering a sequence of approximating spaces to the space of densities. These spaces consist of piecewise constant density functions supported by binary partitions with increasing complexity. To obtain an estimate, the partition is learned by maximizing either the likelihood of the corresponding histogram on that partition, or the marginal posterior probability of the partition under a suitable prior. We analyze the convergence rate of the maximum likelihood estimator and the posterior concentration rate of the Bayesian estimator, and conclude that for a relatively rich class of density functions the rate does not directly depend on the dimension. We also show that the Bayesian method can adapt to the unknown smoothness of the density function. The method is applied to several specific function classes and explicit rates are obtained. These include spatially sparse functions, functions of bounded variation, and H{\"o}lder continuous functions. We also introduce an ensemble approach, obtained by aggregating multiple density estimates fit under carefully designed perturbations, and show that for density functions lying in a H{\"o}lder space ($\mathcal{H}^{1, \beta}, 0 < \beta \leq 1$), the ensemble method can achieve minimax convergence rate up to a logarithmic term, while the corresponding rate of the density estimator based on a single partition is suboptimal for this function class.
Reinforcement Learning for Joint Optimization of Multiple Rewards
http://jmlr.org/papers/v24/19-980.html
http://jmlr.org/papers/volume24/19-980/19-980.pdf
2023Mridul Agarwal, Vaneet Aggarwal
Finding optimal policies which maximize long term rewards of Markov Decision Processes requires the use of dynamic programming and backward induction to solve the Bellman optimality equation. However, many real-world problems require optimization of an objective that is non-linear in cumulative rewards for which dynamic programming cannot be applied directly. For example, in a resource allocation problem, one of the objectives is to maximize long-term fairness among the users. We notice that when an agent aim to optimize some function of the sum of rewards is considered, the problem loses its Markov nature. This paper addresses and formalizes the problem of optimizing a non-linear function of the long term average of rewards. We propose model-based and model-free algorithms to learn the policy, where the model-based policy is shown to achieve a regret of $\Tilde{O}\left(LKDS\sqrt{\frac{A}{T}}\right)$ for $K$ objectives combined with a concave $L$-Lipschitz function. Further, using the fairness in cellular base-station scheduling, and queueing system scheduling as examples, the proposed algorithm is shown to significantly outperform the conventional RL approaches.
On the Convergence of Stochastic Gradient Descent with Bandwidth-based Step Size
http://jmlr.org/papers/v24/19-1009.html
http://jmlr.org/papers/volume24/19-1009/19-1009.pdf
2023Xiaoyu Wang, Ya-xiang Yuan
We first propose a general step-size framework for the stochastic gradient descent(SGD) method: bandwidth-based step sizes that are allowed to vary within a banded region. The framework provides efficient and flexible step size selection in optimization, including cyclical and non-monotonic step sizes (e.g., triangular policy and cosine with restart), for which theoretical guarantees are rare. We provide state-of-the-art convergence guarantees for SGD under mild conditions and allow a large constant step size at the beginning of training. Moreover, we investigate the error bounds of SGD under the bandwidth step size where the boundary functions are in the same order and different orders, respectively. Finally, we propose a $1/t$ up-down policy and design novel non-monotonic step sizes. Numerical experiments demonstrate these bandwidth-based step sizes' efficiency and significant potential in training regularized logistic regression and several large-scale neural network tasks.
A Group-Theoretic Approach to Computational Abstraction: Symmetry-Driven Hierarchical Clustering
http://jmlr.org/papers/v24/18-508.html
http://jmlr.org/papers/volume24/18-508/18-508.pdf
2023Haizi Yu, Igor Mineyev, Lav R. Varshney
Humans' abstraction ability plays a key role in concept learning and knowledge discovery. This theory paper presents the mathematical formulation for computationally emulating human-like abstractions---computational abstraction---and abstraction processes developed hierarchically from innate priors like symmetries. We study the nature of abstraction via a group-theoretic approach, formalizing and practically computing abstractions as symmetry-driven hierarchical clustering. Compared to data-driven clustering like k-means or agglomerative clustering (a chain), our abstraction model is data-free, feature-free, similarity-free, and globally hierarchical (a lattice). This paper also serves as a theoretical generalization of several existing works. These include generalizing Shannon's information lattice, specialized algorithms for certain symmetry-induced clusterings, as well as formalizing knowledge discovery applications such as learning music theory from scores and chemistry laws from molecules. We consider computational abstraction as a first step towards a principled and cognitive way of achieving human-level concept learning and knowledge discovery.
The d-Separation Criterion in Categorical Probability
http://jmlr.org/papers/v24/22-0916.html
http://jmlr.org/papers/volume24/22-0916/22-0916.pdf
2023Tobias Fritz, Andreas Klingler
The d-separation criterion detects the compatibility of a joint probability distribution with a directed acyclic graph through certain conditional independences. In this work, we study this problem in the context of categorical probability theory by introducing a categorical definition of causal models, a categorical notion of d-separation, and proving an abstract version of the d-separation criterion. This approach has two main benefits. First, categorical d-separation is a very intuitive criterion based on topological connectedness. Second, our results apply both to measure-theoretic probability (with standard Borel spaces) and beyond probability theory, including to deterministic and possibilistic networks. It therefore provides a clean proof of the equivalence of local and global Markov properties with causal compatibility for continuous and mixed random variables as well as deterministic and possibilistic variables.
The multimarginal optimal transport formulation of adversarial multiclass classification
http://jmlr.org/papers/v24/22-0698.html
http://jmlr.org/papers/volume24/22-0698/22-0698.pdf
2023Nicolás García Trillos, Matt Jacobs, Jakwang Kim
We study a family of adversarial multiclass classification problems and provide equivalent reformulations in terms of: 1) a family of generalized barycenter problems introduced in the paper and 2) a family of multimarginal optimal transport problems where the number of marginals is equal to the number of classes in the original classification problem. These new theoretical results reveal a rich geometric structure of adversarial learning problems in multiclass classification and extend recent results restricted to the binary classification setting. A direct computational implication of our results is that by solving either the barycenter problem and its dual, or the MOT problem and its dual, we can recover the optimal robust classification rule and the optimal adversarial strategy for the original adversarial problem. Examples with synthetic and real data illustrate our results.
Robust Load Balancing with Machine Learned Advice
http://jmlr.org/papers/v24/22-0629.html
http://jmlr.org/papers/volume24/22-0629/22-0629.pdf
2023Sara Ahmadian, Hossein Esfandiari, Vahab Mirrokni, Binghui Peng
Motivated by the exploding growth of web-based services and the importance of efficiently managing the computational resources of such systems, we introduce and study a theoretical model for load balancing of very large databases such as commercial search engines. Our model is a more realistic version of the well-received \bab model with an additional constraint that limits the number of servers that carry each piece of the data. This additional constraint is necessary when, on one hand, the data is so large that we can not copy the whole data on each server. On the other hand, the query response time is so limited that we can not ignore the fact that the number of queries for each piece of the data changes over time, and hence we can not simply split the data over different machines. In this paper, we develop an almost optimal load balancing algorithm that works given an estimate of the load of each piece of the data. Our algorithm is almost perfectly robust to wrong estimates, to the extent that even when all of the loads are adversarially chosen the performance of our algorithm is $1-1/e$, which is provably optimal. Along the way, we develop various techniques for analyzing the balls-into-bins process under certain correlations and build a novel connection with the multiplicative weights update scheme.
Benchmarking Graph Neural Networks
http://jmlr.org/papers/v24/22-0567.html
http://jmlr.org/papers/volume24/22-0567/22-0567.pdf
2023Vijay Prakash Dwivedi, Chaitanya K. Joshi, Anh Tuan Luu, Thomas Laurent, Yoshua Bengio, Xavier Bresson
In the last few years, graph neural networks (GNNs) have become the standard toolkit for analyzing and learning from data on graphs. This emerging field has witnessed an extensive growth of promising techniques that have been applied with success to computer science, mathematics, biology, physics and chemistry. But for any successful field to become mainstream and reliable, benchmarks must be developed to quantify progress. This led us in March 2020 to release a benchmark framework that i) comprises of a diverse collection of mathematical and real-world graphs, ii) enables fair model comparison with the same parameter budget to identify key architectures, iii) has an open-source, easy-to use and reproducible code infrastructure, and iv) is flexible for researchers to experiment with new theoretical ideas. As of December 2022, the GitHub repository has reached 2,000 stars and 380 forks, which demonstrates the utility of the proposed open-source framework through the wide usage by the GNN community. In this paper, we present an updated version of our benchmark with a concise presentation of the aforementioned framework characteristics, an additional medium-sized molecular dataset AQSOL, similar to the popular ZINC, but with a real-world measured chemical target, and discuss how this framework can be leveraged to explore new GNN designs and insights. As a proof of value of our benchmark, we study the case of graph positional encoding (PE) in GNNs, which was introduced with this benchmark and has since spurred interest of exploring more powerful PE for Transformers and GNNs in a robust experimental setting.
A Simple Approach to Improve Single-Model Deep Uncertainty via Distance-Awareness
http://jmlr.org/papers/v24/22-0479.html
http://jmlr.org/papers/volume24/22-0479/22-0479.pdf
2023Jeremiah Zhe Liu, Shreyas Padhy, Jie Ren, Zi Lin, Yeming Wen, Ghassen Jerfel, Zachary Nado, Jasper Snoek, Dustin Tran, Balaji Lakshminarayanan
Accurate uncertainty quantification is a major challenge in deep learning, as neural networks can make overconfident errors and assign high confidence predictions to out-of-distribution (OOD) inputs. The most popular approaches to estimate predictive uncertainty in deep learning are methods that combine predictions from multiple neural networks, such as Bayesian neural networks (BNNs) and deep ensembles. However their practicality in real-time, industrial-scale applications are limited due to the high memory and computational cost. Furthermore, ensembles and BNNs do not necessarily fix all the issues with the underlying member networks. In this work, we study principled approaches to improve the uncertainty property of a single network, based on a single, deterministic representation. By formalizing the uncertainty quantification as a minimax learning problem, we first identify distance awareness, i.e., the model's ability to quantify the distance of a testing example from the training data, as a necessary condition for a DNN to achieve high-quality (i.e., minimax optimal) uncertainty estimation. We then propose Spectral-normalized Neural Gaussian Process (SNGP), a simple method that improves the distance-awareness ability of modern DNNs with two simple changes: (1) applying spectral normalization to hidden weights to enforce bi-Lipschitz smoothness in representations and (2) replacing the last output layer with a Gaussian process layer. On a suite of vision and language understanding benchmarks and on modern architectures (Wide-ResNet and BERT), SNGP consistently outperforms other single-model approaches in prediction, calibration and out-of-domain detection. Furthermore, SNGP provides complementary benefits to popular techniques such as deep ensembles and data augmentation, making it a simple and scalable building block for probabilistic deep learning.
Neural Implicit Flow: a mesh-agnostic dimensionality reduction paradigm of spatio-temporal data
http://jmlr.org/papers/v24/22-0365.html
http://jmlr.org/papers/volume24/22-0365/22-0365.pdf
2023Shaowu Pan, Steven L. Brunton, J. Nathan Kutz
High-dimensional spatio-temporal dynamics can often be encoded in a low-dimensional subspace. Engineering applications for modeling, characterization, design, and control of such large-scale systems often rely on dimensionality reduction to make solutions computationally tractable in real time. Common existing paradigms for dimensionality reduction include linear methods, such as the singular value decomposition (SVD), and nonlinear methods, such as variants of convolutional autoencoders (CAE). However, these encoding techniques lack the ability to efficiently represent the complexity associated with spatio-temporal data, which often requires variable geometry, non-uniform grid resolution, adaptive meshing, and/or parametric dependencies. To resolve these practical engineering challenges, we propose a general framework called Neural Implicit Flow (NIF) that enables a mesh-agnostic, low-rank representation of large-scale, parametric, spatial-temporal data. NIF consists of two modified multilayer perceptrons (MLPs): (i) ShapeNet, which isolates and represents the spatial complexity, and (ii) ParameterNet, which accounts for any other input complexity, including parametric dependencies, time, and sensor measurements. We demonstrate the utility of NIF for parametric surrogate modeling, enabling the interpretable representation and compression of complex spatio-temporal dynamics, efficient many-spatial-query tasks, and improved generalization performance for sparse reconstruction.
On Batch Teaching Without Collusion
http://jmlr.org/papers/v24/22-0330.html
http://jmlr.org/papers/volume24/22-0330/22-0330.pdf
2023Shaun Fallat, David Kirkpatrick, Hans U. Simon, Abolghasem Soltani, Sandra Zilles
Formal models of learning from teachers need to respect certain criteria to avoid collusion. The most commonly accepted notion of collusion-avoidance was proposed by Goldman and Mathias (1996), and various teaching models obeying their criterion have been studied. For each model $M$ and each concept class $\mathcal{C}$, a parameter $M$-TD$(\mathcal{C})$ refers to the teaching dimension of concept class $\mathcal{C}$ in model $M$---defined to be the number of examples required for teaching a concept, in the worst case over all concepts in $\mathcal{C}$. This paper introduces a new model of teaching, called no-clash teaching, together with the corresponding parameter NCTD$(\mathcal{C})$. No-clash teaching is provably optimal in the strong sense that, given any concept class $\mathcal{C}$ and any model $M$ obeying Goldman and Mathias's collusion-avoidance criterion, one obtains NCTD$(\mathcal{C})\le M$-TD$(\mathcal{C})$. We also study a corresponding notion NCTD$^+$ for the case of learning from positive data only, establish useful bounds on NCTD and NCTD$^+$, and discuss relations of these parameters to other complexity parameters of interest in computational learning theory. We further argue that Goldman and Mathias's collusion-avoidance criterion may in some settings be too weak in that it admits certain forms of interaction between teacher and learner that could be considered collusion in practice. Therefore, we introduce a strictly stronger notion of collusion-avoidance and demonstrate that the well-studied notion of Preference-based Teaching is optimal among all teaching schemes that are strongly collusion-avoiding on all finite subsets of a given concept class.
Sensing Theorems for Unsupervised Learning in Linear Inverse Problems
http://jmlr.org/papers/v24/22-0315.html
http://jmlr.org/papers/volume24/22-0315/22-0315.pdf
2023Julián Tachella, Dongdong Chen, Mike Davies
Solving an ill-posed linear inverse problem requires knowledge about the underlying signal model. In many applications, this model is a priori unknown and has to be learned from data. However, it is impossible to learn the model using observations obtained via a single incomplete measurement operator, as there is no information about the signal model in the nullspace of the operator, resulting in a chicken-and-egg problem: to learn the model we need reconstructed signals, but to reconstruct the signals we need to know the model. Two ways to overcome this limitation are using multiple measurement operators or assuming that the signal model is invariant to a certain group action. In this paper, we present necessary and sufficient sensing conditions for learning the signal model from measurement data alone which only depend on the dimension of the model and the number of operators or properties of the group action that the model is invariant to. As our results are agnostic of the learning algorithm, they shed light into the fundamental limitations of learning from incomplete data and have implications in a wide range set of practical algorithms, such as dictionary learning, matrix completion and deep neural networks.
First-Order Algorithms for Nonlinear Generalized Nash Equilibrium Problems
http://jmlr.org/papers/v24/22-0310.html
http://jmlr.org/papers/volume24/22-0310/22-0310.pdf
2023Michael I. Jordan, Tianyi Lin, Manolis Zampetakis
We consider the problem of computing an equilibrium in a class of nonlinear generalized Nash equilibrium problems (NGNEPs) in which the strategy sets for each player are defined by the equality and inequality constraints that may depend on the choices of rival players. While the asymptotic global convergence and local convergence rate of certain algorithms have been extensively investigated, the iteration complexity analysis is still in its infancy. This paper provides two first-order algorithms based on quadratic penalty method (QPM) and augmented Lagrangian method (ALM), respectively, with an accelerated mirror-prox algorithm as the solver in each inner loop. We show the nonasymptotic convergence rate for these algorithms. In particular, we establish the global convergence guarantee for solving monotone and strongly monotone NGNEPs and provide the complexity bounds expressed in terms of the number of gradient evaluations. Experimental results demonstrate the efficiency of our algorithms in practice.
Ridges, Neural Networks, and the Radon Transform
http://jmlr.org/papers/v24/22-0227.html
http://jmlr.org/papers/volume24/22-0227/22-0227.pdf
2023Michael Unser
A ridge is a function that is characterized by a one-dimensional profile (activation) and a multidimensional direction vector. Ridges appear in the theory of neural networks as functional descriptors of the effect of a neuron, with the direction vector being encoded in the linear weights. In this paper, we investigate properties of the Radon transform in relation to ridges and to the characterization of neural networks. We introduce a broad category of hyper-spherical Banach subspaces (including the relevant subspace of measures) over which the back-projection operator is invertible. We also give conditions under which the back-projection operator is extendable to the full parent space with its null space being identifiable as a Banach complement. Starting from first principles, we then characterize the sampling functionals that are in the range of the filtered Radon transform. Next, we extend the definition of ridges for any distributional profile and determine their (filtered) Radon transform in full generality. Finally, we apply our formalism to clarify and simplify some of the results and proofs on the optimality of ReLU networks that have appeared in the literature.
Label Distribution Changing Learning with Sample Space Expanding
http://jmlr.org/papers/v24/22-0210.html
http://jmlr.org/papers/volume24/22-0210/22-0210.pdf
2023Chao Xu, Hong Tao, Jing Zhang, Dewen Hu, Chenping Hou
With the evolution of data collection ways, label ambiguity has arisen from various applications. How to reduce its uncertainty and leverage its effectiveness is still a challenging task. As two types of representative label ambiguities, Label Distribution Learning (LDL), which annotates each instance with a label distribution, and Emerging New Class (ENC), which focuses on model reusing with new classes, have attached extensive attentions. Nevertheless, in many applications, such as emotion distribution recognition and facial age estimation, we may face a more complicated label ambiguity scenario, i.e., label distribution changing with sample space expanding owing to the new class. To solve this crucial but rarely studied problem, we propose a new framework named as Label Distribution Changing Learning (LDCL) in this paper, together with its theoretical guarantee with generalization error bound. Our approach expands the sample space by re-scaling previous distribution and then estimates the emerging label value via scaling constraint factor. For demonstration, we present two special cases within the framework, together with their optimizations and convergence analyses. Besides evaluating LDCL on most of the existing 13 data sets, we also apply it in the application of emotion distribution recognition. Experimental results demonstrate the effectiveness of our approach in both tackling label ambiguity problem and estimating facial emotion
Can Reinforcement Learning Find Stackelberg-Nash Equilibria in General-Sum Markov Games with Myopically Rational Followers?
http://jmlr.org/papers/v24/22-0203.html
http://jmlr.org/papers/volume24/22-0203/22-0203.pdf
2023Han Zhong, Zhuoran Yang, Zhaoran Wang, Michael I. Jordan
We study multi-player general-sum Markov games with one of the players designated as the leader and the other players regarded as followers. In particular, we focus on the class of games where the followers are myopically rational; i.e., they aim to maximize their instantaneous rewards. For such a game, our goal is to find a Stackelberg-Nash equilibrium (SNE), which is a policy pair $(\pi^*, \nu^*)$ such that: (i) $\pi^*$ is the optimal policy for the leader when the followers always play their best response, and (ii) $\nu^*$ is the best response policy of the followers, which is a Nash equilibrium of the followers' game induced by $\pi^*$. We develop sample-efficient reinforcement learning (RL) algorithms for solving for an SNE in both online and offline settings. Our algorithms are optimistic and pessimistic variants of least-squares value iteration, and they are readily able to incorporate function approximation tools in the setting of large state spaces. Furthermore, for the case with linear function approximation, we prove that our algorithms achieve sublinear regret and suboptimality under online and offline setups respectively.
To the best of our knowledge, we establish the first provably efficient RL algorithms for solving for SNEs in general-sum Markov games with myopically rational followers.
Quantus: An Explainable AI Toolkit for Responsible Evaluation of Neural Network Explanations and Beyond
http://jmlr.org/papers/v24/22-0142.html
http://jmlr.org/papers/volume24/22-0142/22-0142.pdf
2023Anna Hedström, Leander Weber, Daniel Krakowczyk, Dilyara Bareeva, Franz Motzkus, Wojciech Samek, Sebastian Lapuschkin, Marina M.-C. Höhne
The evaluation of explanation methods is a research topic that has not yet been explored deeply, however, since explainability is supposed to strengthen trust in artificial intelligence, it is necessary to systematically review and compare explanation methods in order to confirm their correctness. Until now, no tool with focus on XAI evaluation exists that exhaustively and speedily allows researchers to evaluate the performance of explanations of neural network predictions. To increase transparency and reproducibility in the field, we therefore built Quantus—a comprehensive, evaluation toolkit in Python that includes a growing, well-organised collection of evaluation metrics and tutorials for evaluating explainable methods. The toolkit has been thoroughly tested and is available under an open-source license on PyPi (or on https://github.com/understandable-machine-intelligence-lab/Quantus/).
Gap Minimization for Knowledge Sharing and Transfer
http://jmlr.org/papers/v24/22-0099.html
http://jmlr.org/papers/volume24/22-0099/22-0099.pdf
2023Boyu Wang, Jorge A. Mendez, Changjian Shui, Fan Zhou, Di Wu, Gezheng Xu, Christian Gagné, Eric Eaton
Learning from multiple related tasks by knowledge sharing and transfer has become increasingly relevant over the last two decades. In order to successfully transfer information from one task to another, it is critical to understand the similarities and differences between the domains. In this paper, we introduce the notion of performance gap, an intuitive and novel measure of the distance between learning tasks. Unlike existing measures which are used as tools to bound the difference of expected risks between tasks (e.g., $\mathcal{H}$-divergence or discrepancy distance), we theoretically show that the performance gap can be viewed as a data- and algorithm-dependent regularizer, which controls the model complexity and leads to finer guarantees. More importantly, it also provides new insights and motivates a novel principle for designing strategies for knowledge sharing and transfer: gap minimization. We instantiate this principle with two algorithms: 1. gapBoost, a novel and principled boosting algorithm that explicitly minimizes the performance gap between source and target domains for transfer learning; and 2. gapMTNN, a representation learning algorithm that reformulates gap minimization as semantic conditional matching for multitask learning. Our extensive evaluation on both transfer learning and multitask learning benchmark data sets shows that our methods outperform existing baselines.
Sparse PCA: a Geometric Approach
http://jmlr.org/papers/v24/22-0088.html
http://jmlr.org/papers/volume24/22-0088/22-0088.pdf
2023Dimitris Bertsimas, Driss Lahlou Kitane
We consider the problem of maximizing the variance explained from a data matrix using orthogonal sparse principal components that have a support of fixed cardinality. While most existing methods focus on building principal components (PCs) iteratively through deflation, we propose GeoSPCA, a novel algorithm to build all PCs at once while satisfying the orthogonality constraints which brings substantial benefits over deflation. This novel approach is based on the left eigenvalues of the covariance matrix which helps circumvent the non-convexity of the problem by approximating the optimal solution using a binary linear optimization problem that can find the optimal solution. The resulting approximation can be used to tackle different versions of the sparse PCA problem including the case in which the principal components share the same support or have disjoint supports and the Structured Sparse PCA problem. We also propose optimality bounds and illustrate the benefits of GeoSPCA in selected real world problems both in terms of explained variance, sparsity and tractability. Improvements vs. the greedy algorithm, which is often at par with state-of-the-art techniques, reaches up to 24% in terms of variance while solving real world problems with 10,000s of variables and support cardinality of 100s in minutes. We also apply GeoSPCA in a face recognition problem yielding more than 10% improvement vs. other PCA based technique such as structured sparse PCA.
Labels, Information, and Computation: Efficient Learning Using Sufficient Labels
http://jmlr.org/papers/v24/22-0019.html
http://jmlr.org/papers/volume24/22-0019/22-0019.pdf
2023Shiyu Duan, Spencer Chang, Jose C. Principe
In supervised learning, obtaining a large set of fully-labeled training data is expensive. We show that we do not always need full label information on every single training example to train a competent classifier. Specifically, inspired by the principle of sufficiency in statistics, we present a statistic (a summary) of the fully-labeled training set that captures almost all the relevant information for classification but at the same time is easier to obtain directly. We call this statistic "sufficiently-labeled data" and prove its sufficiency and efficiency for finding the optimal hidden representations, on which competent classifier heads can be trained using as few as a single randomly-chosen fully-labeled example per class. Sufficiently-labeled data can be obtained from annotators directly without collecting the fully-labeled data first. And we prove that it is easier to directly obtain sufficiently-labeled data than obtaining fully-labeled data. Furthermore, sufficiently-labeled data is naturally more secure since it stores relative, instead of absolute, information. Extensive experimental results are provided to support our theory.
Attacks against Federated Learning Defense Systems and their Mitigation
http://jmlr.org/papers/v24/22-0014.html
http://jmlr.org/papers/volume24/22-0014/22-0014.pdf
2023Cody Lewis, Vijay Varadharajan, Nasimul Noman
The susceptibility of federated learning (FL) to attacks from untrustworthy endpoints has led to the design of several defense systems. FL defense systems enhance the federated optimization algorithm using anomaly detection, scaling the updates from endpoints depending on their anomalous behavior. However, the defense systems themselves may be exploited by the endpoints with more sophisticated attacks. First, this paper proposes three categories of attacks and shows that they can effectively deceive some well-known FL defense systems. In the first two categories, referred to as on-off attacks, the adversary toggles between being honest and engaging in attacks. We analyse two such on-off attacks, label flipping and free riding, and show their impact against existing FL defense systems. As a third category, we propose attacks based on “good mouthing” and “bad mouthing”, to boost or diminish influence of the victim endpoints on the global model. Secondly, we propose a new federated optimization algorithm, Viceroy, that can successfully mitigate all the proposed attacks. The proposed attacks and the mitigation strategy have been tested on a number of different experiments establishing their effectiveness in comparison with other contemporary methods. The proposed algorithm has also been made available as open source. Finally, in the appendices, we provide an induction proof for the on-off model poisoning attack, and the proof of convergence and adversarial tolerance for the new federated optimization algorithm.
HiClass: a Python Library for Local Hierarchical Classification Compatible with Scikit-learn
http://jmlr.org/papers/v24/21-1518.html
http://jmlr.org/papers/volume24/21-1518/21-1518.pdf
2023Fábio M. Miranda, Niklas Köhnecke, Bernhard Y. Renard
HiClass is an open-source Python library for local hierarchical classification entirely compatible with scikit-learn. It contains implementations of the most common design patterns for hierarchical machine learning models found in the literature, that is, the local classifiers per node, per parent node and per level. Additionally, the package contains implementations of hierarchical metrics, which are more appropriate for evaluating classification performance on hierarchical data. The documentation includes installation and usage instructions, examples within tutorials and interactive notebooks, and a complete description of the API. HiClass is released under the simplified BSD license, encouraging its use in both academic and commercial environments. Source code and documentation are available at https://github.com/scikit-learn-contrib/hiclass.
Impact of classification difficulty on the weight matrices spectra in Deep Learning and application to early-stopping
http://jmlr.org/papers/v24/21-1441.html
http://jmlr.org/papers/volume24/21-1441/21-1441.pdf
2023XuranMeng, JeffYao
Much recent research effort has been devoted to explain the success of deep learning. Random Matrix Theory (RMT) provides an emerging way to this end by analyzing the spectra of large random matrices involved in a trained deep neural network (DNN) such as weight matrices or Hessian matrices in the stochastic gradient descent algorithm. To better understand spectra of weight matrices, we conduct extensive experiments on weight matrices under different settings for layers, networks and data sets. Based on the previous work of {martin2018implicit}, spectra of weight matrices at the terminal stage of training are classified into three main types: Light Tail (LT), Bulk Transition period (BT) and Heavy Tail (HT). These different types, especially HT, implicitly indicate some regularization in the DNNs. In this paper, inspired from {martin2018implicit}, we identify the difficulty of the classification problem as an important factor for the appearance of HT in weight matrices spectra. Higher the classification difficulty, higher the chance for HT to appear. Moreover, the classification difficulty can be affected either by the signal-to-noise ratio of the dataset, or by the complexity of the classification problem (complex features, large number of classes) as well. Leveraging on this finding, we further propose a spectral criterion to detect the appearance of HT and use it to early stop the training process without testing data. Such early stopped DNNs have the merit of avoiding overfitting and unnecessary extra training while preserving a much comparable generalization ability. These findings from the paper are validated in several NNs (LeNet, MiniAlexNet and VGG), using Gaussian synthetic data and real data sets (MNIST and CIFAR10).
The SKIM-FA Kernel: High-Dimensional Variable Selection and Nonlinear Interaction Discovery in Linear Time
http://jmlr.org/papers/v24/21-1403.html
http://jmlr.org/papers/volume24/21-1403/21-1403.pdf
2023Raj Agrawal, Tamara Broderick
Many scientific problems require identifying a small set of covariates that are associated with a target response and estimating their effects. Often, these effects are nonlinear and include interactions, so linear and additive methods can lead to poor estimation and variable selection. Unfortunately, methods that simultaneously express sparsity, nonlinearity, and interactions are computationally intractable --- with runtime at least quadratic in the number of covariates, and often worse. In the present work, we solve this computational bottleneck. We show that suitable interaction models have a kernel representation, namely there exists a "kernel trick" to perform variable selection and estimation in $O$(# covariates) time. Our resulting fit corresponds to a sparse orthogonal decomposition of the regression function in a Hilbert space (i.e., a functional ANOVA decomposition), where interaction effects represent all variation that cannot be explained by lower-order effects. On a variety of synthetic and real data sets, our approach outperforms existing methods used for large, high-dimensional data sets while remaining competitive (or being orders of magnitude faster) in runtime.
Generalization Bounds for Noisy Iterative Algorithms Using Properties of Additive Noise Channels
http://jmlr.org/papers/v24/21-1396.html
http://jmlr.org/papers/volume24/21-1396/21-1396.pdf
2023Hao Wang, Rui Gao, Flavio P. Calmon
Machine learning models trained by different optimization algorithms under different data distributions can exhibit distinct generalization behaviors. In this paper, we analyze the generalization of models trained by noisy iterative algorithms. We derive distribution-dependent generalization bounds by connecting noisy iterative algorithms to additive noise channels found in communication and information theory. Our generalization bounds shed light on several applications, including differentially private stochastic gradient descent (DP-SGD), federated learning, and stochastic gradient Langevin dynamics (SGLD). We demonstrate our bounds through numerical experiments, showing that they can help understand recent empirical observations of the generalization phenomena of neural networks.
Discrete Variational Calculus for Accelerated Optimization
http://jmlr.org/papers/v24/21-1323.html
http://jmlr.org/papers/volume24/21-1323/21-1323.pdf
2023Cédric M. Campos, Alejandro Mahillo, David Martín de Diego
Many of the new developments in machine learning are connected with gradient-based optimization methods. Recently, these methods have been studied using a variational perspective (Betancourt et al., 2018). This has opened up the possibility of introducing variational and symplectic methods using geometric integration. In particular, in this paper, we introduce variational integrators (Marsden and West, 2001) which allow us to derive different methods for optimization. Using both Hamilton’s and Lagrange-d’Alembert’s principle, we derive two families of optimization methods in one-to-one correspondence that generalize Polyak’s heavy ball (Polyak, 1964) and Nesterov’s accelerated gradient (Nesterov, 1983), the second of which mimics the behavior of the latter reducing the oscillations of classical momentum methods. However, since the systems considered are explicitly time-dependent, the preservation of symplecticity of autonomous systems occurs here solely on the fibers. Several experiments exemplify the result.
Calibrated Multiple-Output Quantile Regression with Representation Learning
http://jmlr.org/papers/v24/21-1280.html
http://jmlr.org/papers/volume24/21-1280/21-1280.pdf
2023Shai Feldman, Stephen Bates, Yaniv Romano
We develop a method to generate predictive regions that cover a multivariate response variable with a user-specified probability. Our work is composed of two components. First, we use a deep generative model to learn a representation of the response that has a unimodal distribution. Existing multiple-output quantile regression approaches are effective in such cases, so we apply them on the learned representation, and then transform the solution to the original space of the response. This process results in a flexible and informative region that can have an arbitrary shape, a property that existing methods lack. Second, we propose an extension of conformal prediction to the multivariate response setting that modifies any method to return sets with a pre-specified coverage level. The desired coverage is theoretically guaranteed in the finite-sample case for any distribution. Experiments conducted on both real and synthetic data show that our method constructs regions that are significantly smaller compared to existing techniques.
Bayesian Data Selection
http://jmlr.org/papers/v24/21-1067.html
http://jmlr.org/papers/volume24/21-1067/21-1067.pdf
2023Eli N. Weinstein, Jeffrey W. Miller
Insights into complex, high-dimensional data can be obtained by discovering features of the data that match or do not match a model of interest. To formalize this task, we introduce the "data selection" problem: finding a lower-dimensional statistic - such as a subset of variables - that is well fit by a given parametric model of interest. A fully Bayesian approach to data selection would be to parametrically model the value of the statistic, nonparametrically model the remaining "background" components of the data, and perform standard Bayesian model selection for the choice of statistic. However, fitting a nonparametric model to high-dimensional data tends to be highly inefficient, statistically and computationally. We propose a novel score for performing data selection, the "Stein volume criterion (SVC)", that does not require fitting a nonparametric model. The SVC takes the form of a generalized marginal likelihood with a kernelized Stein discrepancy in place of the Kullback-Leibler divergence. We prove that the SVC is consistent for data selection, and establish consistency and asymptotic normality of the corresponding generalized posterior on parameters. We apply the SVC to the analysis of single-cell RNA sequencing data sets using probabilistic principal components analysis and a spin glass model of gene regulation.
Lower Bounds and Accelerated Algorithms for Bilevel Optimization
http://jmlr.org/papers/v24/21-0949.html
http://jmlr.org/papers/volume24/21-0949/21-0949.pdf
2023Kaiyi ji, Yingbin Liang
Bilevel optimization has recently attracted growing interests due to its wide applications in modern machine learning problems. Although recent studies have characterized the convergence rate for several such popular algorithms, it is still unclear how much further these convergence rates can be improved. In this paper, we address this fundamental question from two perspectives. First, we provide the first-known lower complexity bounds of $\widetilde \Omega\bigg(\sqrt{\frac{L_y\widetilde L_{xy}^2}{\mu_x\mu_y^2}}\bigg)$ and $\widetilde \Omega\big(\frac{1}{\sqrt{\epsilon}}\min\{\kappa_y,\frac{1}{\sqrt{\epsilon^{3}}}\}\big)$ respectively for strongly-convex-strongly-convex and convex-strongly-convex bilevel optimizations. Second, we propose an accelerated bilevel optimizer named AccBiO, for which we provide the first-known complexity bounds without the gradient boundedness assumption (which was made in existing analyses) under the two aforementioned geometries. We also provide significantly tighter upper bounds than the existing complexity when the bounded gradient assumption does hold. We show that AccBiO achieves the optimal results (i.e., the upper and lower bounds match up to logarithmic factors) when the inner-level problem takes a quadratic form with a constant-level condition number. Interestingly, our lower bounds under both geometries are larger than the corresponding optimal complexities of minimax optimization, establishing that bilevel optimization is provably more challenging than minimax optimization. Our theoretical results are validated by numerical experiments.
Graph-Aided Online Multi-Kernel Learning
http://jmlr.org/papers/v24/21-0877.html
http://jmlr.org/papers/volume24/21-0877/21-0877.pdf
2023Pouya M. Ghari, Yanning Shen
Multi-kernel learning (MKL) has been widely used in learning problems involving function learning tasks. Compared with single kernel learning approach which relies on a pre-selected kernel, the advantage of MKL is its flexibility results from combining a dictionary of kernels. However, inclusion of irrelevant kernels in the dictionary may deteriorate the accuracy of MKL, and increase the computational complexity. Faced with this challenge, a novel graph-aided framework is developed to select a subset of kernels from the dictionary with the assistance of a graph. Different graph construction and refinement schemes are developed based on incurred losses or kernel similarities to assist the adaptive selection process. Moreover, to cope with the scenario where data may be collected in a sequential fashion, or cannot be stored in batch due to the massive scale, random feature approximation are adopted to enable online function learning. It is proved that our proposed algorithms enjoy sub-linear regret bounds. Experiments on a number of real datasets showcase the advantages of our novel graph-aided algorithms compared to state-of-the-art alternatives.
Interpolating Classifiers Make Few Mistakes
http://jmlr.org/papers/v24/21-0844.html
http://jmlr.org/papers/volume24/21-0844/21-0844.pdf
2023Tengyuan Liang, Benjamin Recht
This paper provides elementary analyses of the regret and generalization of minimum-norm interpolating classifiers (MNIC). The MNIC is the function of smallest Reproducing Kernel Hilbert Space norm that perfectly interpolates a label pattern on a finite data set. We derive a mistake bound for MNIC and a regularized variant that holds for all data sets. This bound follows from elementary properties of matrix inverses. Under the assumption that the data is independently and identically distributed, the mistake bound implies that MNIC generalizes at a rate proportional to the norm of the interpolating solution and inversely proportional to the number of data points. This rate matches similar rates derived for margin classifiers and perceptrons. We derive several plausible generative models where the norm of the interpolating classifier is bounded or grows at a rate sublinear in $n$. We also show that as long as the population class conditional distributions are sufficiently separable in total variation, then MNIC generalizes with a fast rate.
Regularized Joint Mixture Models
http://jmlr.org/papers/v24/21-0796.html
http://jmlr.org/papers/volume24/21-0796/21-0796.pdf
2023Konstantinos Perrakis, Thomas Lartigue, Frank Dondelinger, Sach Mukherjee
Regularized regression models are well studied and, under appropriate conditions, offer fast and statistically interpretable results. However, large data in many applications are heterogeneous in the sense of harboring distributional differences between latent groups. Then, the assumption that the conditional distribution of response $Y$ given features $X$ is the same for all samples may not hold. Furthermore, in scientific applications, the covariance structure of the features may contain important signals and its learning is also affected by latent group structure. We propose a class of mixture models for paired data $(X,Y)$ that couples together the distribution of $X$ (using sparse graphical models) and the conditional $Y \! \mid \! X$ (using sparse regression models). The regression and graphical models are specific to the latent groups and model parameters are estimated jointly. This allows signals in either or both of the feature distribution and regression model to inform learning of latent structure and provides automatic control of confounding by such structure. Estimation is handled via an expectation-maximization algorithm, whose convergence is established theoretically. We illustrate the key ideas via empirical examples. An R package is available at https://github.com/k-perrakis/regjmix.
An Inertial Block Majorization Minimization Framework for Nonsmooth Nonconvex Optimization
http://jmlr.org/papers/v24/21-0571.html
http://jmlr.org/papers/volume24/21-0571/21-0571.pdf
2023Le Thi Khanh Hien, Duy Nhat Phan, Nicolas Gillis
In this paper, we introduce TITAN, a novel inerTIal block majorizaTion minimizAtioN ramework for nonsmooth nonconvex optimization problems. To the best of our knowledge, TITAN is the first framework of block-coordinate update method that relies on the majorization-minimization framework while embedding inertial force to each step of the block updates. The inertial force is obtained via an extrapolation operator that subsumes heavy-ball and Nesterov-type accelerations for block proximal gradient methods as special cases. By choosing various surrogate functions, such as proximal, Lipschitz gradient, Bregman, quadratic, and composite surrogate functions, and by varying the extrapolation operator, TITAN produces a rich set of inertial block-coordinate update methods. We study sub-sequential convergence as well as global convergence for the generated sequence of TITAN. We illustrate the effectiveness of TITAN on two important machine learning problems, namely sparse non-negative matrix factorization and matrix completion.
Learning Mean-Field Games with Discounted and Average Costs
http://jmlr.org/papers/v24/21-0505.html
http://jmlr.org/papers/volume24/21-0505/21-0505.pdf
2023Berkay Anahtarci, Can Deha Kariksiz, Naci Saldi
We consider learning approximate Nash equilibria for discrete-time mean-field games with stochastic nonlinear state dynamics subject to both average and discounted costs. To this end, we introduce a mean-field equilibrium (MFE) operator, whose fixed point is a mean-field equilibrium, i.e., equilibrium in the infinite population limit. We first prove that this operator is a contraction, and propose a learning algorithm to compute an approximate mean-field equilibrium by approximating the MFE operator with a random one. Moreover, using the contraction property of the MFE operator, we establish the error analysis of the proposed learning algorithm. We then show that the learned mean-field equilibrium constitutes an approximate Nash equilibrium for finite-agent games.
Globally-Consistent Rule-Based Summary-Explanations for Machine Learning Models: Application to Credit-Risk Evaluation
http://jmlr.org/papers/v24/21-0488.html
http://jmlr.org/papers/volume24/21-0488/21-0488.pdf
2023Cynthia Rudin, Yaron Shaposhnik
We develop a method for understanding specific predictions made by (global) predictive models by constructing (local) models tailored to each specific observation (these are also called “explanations" in the literature). Unlike existing work that “explains” specific observations by approximating global models in the vicinity of these observations, we fit models that are globally-consistent with predictions made by the global model on past data. We focus on rule-based models (also known as association rules or conjunctions of predicates), which are interpretable and widely used in practice. We design multiple algorithms to extract such rules from discrete and continuous datasets, and study their theoretical properties. Finally, we apply these algorithms to multiple credit-risk models trained on the Explainable Machine Learning Challenge data from FICO and demonstrate that our approach effectively produces sparse summary-explanations of these models in seconds. Our approach is model-agnostic (that is, can be used to explain any predictive model), and solves a minimum set cover problem to construct its summaries.
Extending Adversarial Attacks to Produce Adversarial Class Probability Distributions
http://jmlr.org/papers/v24/21-0326.html
http://jmlr.org/papers/volume24/21-0326/21-0326.pdf
2023Jon Vadillo, Roberto Santana, Jose A. Lozano
Despite the remarkable performance and generalization levels of deep learning models in a wide range of artificial intelligence tasks, it has been demonstrated that these models can be easily fooled by the addition of imperceptible yet malicious perturbations to natural inputs. These altered inputs are known in the literature as adversarial examples. In this paper, we propose a novel probabilistic framework to generalize and extend adversarial attacks in order to produce a desired probability distribution for the classes when we apply the attack method to a large number of inputs. This novel attack paradigm provides the adversary with greater control over the target model, thereby exposing, in a wide range of scenarios, threats against deep learning models that cannot be conducted by the conventional paradigms. We introduce four different strategies to efficiently generate such attacks, and illustrate our approach by extending multiple adversarial attack algorithms. We also experimentally validate our approach for the spoken command classification task and the Tweet emotion classification task, two exemplary machine learning problems in the audio and text domain, respectively. Our results demonstrate that we can closely approximate any probability distribution for the classes while maintaining a high fooling rate and even prevent the attacks from being detected by label-shift detection methods.
Python package for causal discovery based on LiNGAM
http://jmlr.org/papers/v24/21-0321.html
http://jmlr.org/papers/volume24/21-0321/21-0321.pdf
2023Takashi Ikeuchi, Mayumi Ide, Yan Zeng, Takashi Nicholas Maeda, Shohei Shimizu
Causal discovery is a methodology for learning causal graphs from data, and LiNGAM is a well-known model for causal discovery. This paper describes an open-source Python package for causal discovery based on LiNGAM. The package implements various LiNGAM methods under different settings like time series cases, multiple-group cases, mixed data cases, and hidden common cause cases, in addition to evaluation of statistical reliability and model assumptions. The source code is freely available under the MIT license at https://github.com/cdt15/lingam.
Adaptation to the Range in K-Armed Bandits
http://jmlr.org/papers/v24/21-0148.html
http://jmlr.org/papers/volume24/21-0148/21-0148.pdf
2023Hédi Hadiji, Gilles Stoltz
We consider stochastic bandit problems with $K$ arms, each associated with a distribution supported on a given finite range $[m,M]$. We do not assume that the range $[m,M]$ is known and show that there is a cost for learning this range. Indeed, a new trade-off between distribution-dependent and distribution-free regret bounds arises, which prevents from simultaneously achieving the typical $\ln T$ and $\sqrt{T}$ bounds. For instance, a $\sqrt{T}$ distribution-free regret bound may only be achieved if the distribution-dependent regret bounds are at least of order $\sqrt{T}$. We exhibit a strategy achieving the rates for regret imposed by the new trade-off.
Learning-augmented count-min sketches via Bayesian nonparametrics
http://jmlr.org/papers/v24/21-0096.html
http://jmlr.org/papers/volume24/21-0096/21-0096.pdf
2023Emanuele Dolera, Stefano Favaro, Stefano Peluchetti
The count-min sketch (CMS) is a time and memory efficient randomized data structure that provides estimates of tokens' frequencies in a data stream of tokens, i.e. point queries, based on random hashed data. A learning-augmented version of the CMS, referred to as CMS-DP, has been proposed by Cai, Mitzenmacher and Adams (NeurIPS 2018), and it relies on Bayesian nonparametric (BNP) modeling of the data stream of tokens via a Dirichlet process (DP) prior, with estimates of a point query being that are obtained as mean functionals of the posterior distribution of the point query, given the hashed data. While the CMS-DP has proved to improve on some aspects of CMS, it has the major drawback of arising from a “constructive" proof that builds upon arguments that are tailored to the DP prior, namely arguments that are not usable for other nonparametric priors. In this paper, we present a “Bayesian" proof of the CMS-DP that has the main advantage of building upon arguments that are usable under the popular Pitman-Yor process (PYP) prior, which generalizes the DP prior by allowing for a more flexible tail behaviour, ranging from geometric tails to heavy power-law tails. This result leads to develop a novel learning-augmented CMS under power-law data streams, referred to as CMS-PYP, which relies on BNP modeling of the data stream of tokens via a PYP prior. Under this more general framework, we apply the arguments of the “Bayesian" proof of the CMS-DP, suitably adapted to the PYP prior, in order to compute the posterior distribution of a point query, given the hashed data. Applications to synthetic data and real textual data show that the CMS-PYP outperforms the CMS and the CMS-DP in estimating low-frequency tokens, which are known to be of critical interest in textual data, and it is competitive with respect to a variation of the CMS designed to deal with the estimation of low-frequency tokens. An extension of our BNP approach to more general queries, such as range queries, is also discussed.
Optimal Strategies for Reject Option Classifiers
http://jmlr.org/papers/v24/21-0048.html
http://jmlr.org/papers/volume24/21-0048/21-0048.pdf
2023Vojtech Franc, Daniel Prusa, Vaclav Voracek
In classification with a reject option, the classifier is allowed in uncertain cases to abstain from prediction. The classical cost-based model of a reject option classifier requires the rejection cost to be defined explicitly. The alternative bounded-improvement model and the bounded-abstention model avoid the notion of the reject cost. The bounded-improvement model seeks a classifier with a guaranteed selective risk and maximal cover. The bounded-abstention model seeks a classifier with guaranteed cover and minimal selective risk. We prove that despite their different formulations the three rejection models lead to the same prediction strategy: the Bayes classifier endowed with a randomized Bayes selection function. We define the notion of a proper uncertainty score as a scalar summary of the prediction uncertainty sufficient to construct the randomized Bayes selection function. We propose two algorithms to learn the proper uncertainty score from examples for an arbitrary black-box classifier. We prove that both algorithms provide Fisher consistent estimates of the proper uncertainty score and demonstrate their efficiency in different prediction problems, including classification, ordinal regression, and structured output classification.
A Line-Search Descent Algorithm for Strict Saddle Functions with Complexity Guarantees
http://jmlr.org/papers/v24/20-608.html
http://jmlr.org/papers/volume24/20-608/20-608.pdf
2023Michael J. O'Neill, Stephen J. Wright
We describe a line-search algorithm which achieves the best-known worst-case complexity results for problems with a certain “strict saddle” property that has been observed to hold in low-rank matrix optimization problems. Our algorithm is adaptive, in the sense that it makes use of backtracking line searches and does not require prior knowledge of the parameters that define the strict saddle property.
Sampling random graph homomorphisms and applications to network data analysis
http://jmlr.org/papers/v24/20-449.html
http://jmlr.org/papers/volume24/20-449/20-449.pdf
2023Hanbaek Lyu, Facundo Memoli, David Sivakoff
A graph homomorphism is a map between two graphs that preserves adjacency relations. We consider the problem of sampling a random graph homomorphism from a graph into a large network. We propose two complementary MCMC algorithms for sampling random graph homomorphisms and establish bounds on their mixing times and the concentration of their time averages. Based on our sampling algorithms, we propose a novel framework for network data analysis that circumvents some of the drawbacks in methods based on independent and neighborhood sampling. Various time averages of the MCMC trajectory give us various computable observables, including well-known ones such as homomorphism density and average clustering coefficient and their generalizations. Furthermore, we show that these network observables are stable with respect to a suitably renormalized cut distance between networks. We provide various examples and simulations demonstrating our framework through synthetic networks. We also \commHL{demonstrate the performance of} our framework on the tasks of network clustering and subgraph classification on the Facebook100 dataset and on Word Adjacency Networks of a set of classic novels.
A Relaxed Inertial Forward-Backward-Forward Algorithm for Solving Monotone Inclusions with Application to GANs
http://jmlr.org/papers/v24/20-267.html
http://jmlr.org/papers/volume24/20-267/20-267.pdf
2023Radu I. Bot, Michael Sedlmayer, Phan Tu Vuong
We introduce a relaxed inertial forward-backward-forward (RIFBF) splitting algorithm for approaching the set of zeros of the sum of a maximally monotone operator and a single-valued monotone and Lipschitz continuous operator. This work aims to extend Tseng's forward-backward-forward method by both using inertial effects as well as relaxation parameters. We formulate first a second order dynamical system that approaches the solution set of the monotone inclusion problem to be solved and provide an asymptotic analysis for its trajectories. We provide for RIFBF, which follows by explicit time discretization, a convergence analysis in the general monotone case as well as when applied to the solving of pseudo-monotone variational inequalities. We illustrate the proposed method by applications to a bilinear saddle point problem, in the context of which we also emphasize the interplay between the inertial and the relaxation parameters, and to the training of Generative Adversarial Networks (GANs).
On Distance and Kernel Measures of Conditional Dependence
http://jmlr.org/papers/v24/20-238.html
http://jmlr.org/papers/volume24/20-238/20-238.pdf
2023Tianhong Sheng, Bharath K. Sriperumbudur
Measuring conditional dependence is one of the important tasks in statistical inference and is fundamental in causal discovery, feature selection, dimensionality reduction, Bayesian network learning, and others. In this work, we explore the connection between conditional dependence measures induced by distances on a metric space and reproducing kernels associated with a reproducing kernel Hilbert space (RKHS). For certain distance and kernel pairs, we show the distance-based conditional dependence measures to be equivalent to that of kernel-based measures. On the other hand, we also show that some popular kernel conditional dependence measures based on the Hilbert-Schmidt norm of a certain cross-conditional covariance operator, do not have a simple distance representation, except in some limiting cases.
AutoKeras: An AutoML Library for Deep Learning
http://jmlr.org/papers/v24/20-1355.html
http://jmlr.org/papers/volume24/20-1355/20-1355.pdf
2023Haifeng Jin, François Chollet, Qingquan Song, Xia Hu
To use deep learning, one needs to be familiar with various software tools like TensorFlow or Keras, as well as various model architecture and optimization best practices. Despite recent progress in software usability, deep learning remains a highly specialized occupation. To enable people with limited machine learning and programming experience to adopt deep learning, we developed AutoKeras, an Automated Machine Learning (AutoML) library that automates the process of model selection and hyperparameter tuning. AutoKeras encapsulates the complex process of building and training deep neural networks into a very simple and accessible interface, which enables novice users to solve standard machine learning problems with a few lines of code. Designed with practical applications in mind, AutoKeras is built on top of Keras and TensorFlow, and all AutoKeras-created models can be easily exported and deployed with the help of the TensorFlow ecosystem tooling.
Cluster-Specific Predictions with Multi-Task Gaussian Processes
http://jmlr.org/papers/v24/20-1321.html
http://jmlr.org/papers/volume24/20-1321/20-1321.pdf
2023Arthur Leroy, Pierre Latouche, Benjamin Guedj, Servane Gey
A model involving Gaussian processes (GPs) is introduced to simultaneously handle multitask learning, clustering, and prediction for multiple functional data. This procedure acts as a model-based clustering method for functional data as well as a learning step for subsequent predictions for new tasks. The model is instantiated as a mixture of multi-task GPs with common mean processes. A variational EM algorithm is derived for dealing with the optimisation of the hyper-parameters along with the hyper-posteriors’ estimation of latent variables and processes. We establish explicit formulas for integrating the mean processes and the latent clustering variables within a predictive distribution, accounting for uncertainty in both aspects. This distribution is defined as a mixture of cluster-specific GP predictions, which enhances the performance when dealing with group-structured data. The model handles irregular grids of observations and offers different hypotheses on the covariance structure for sharing additional information across tasks. The performances on both clustering and prediction tasks are assessed through various simulated scenarios and real data sets. The overall algorithm, called MagmaClust, is publicly available as an R package.
Efficient Structure-preserving Support Tensor Train Machine
http://jmlr.org/papers/v24/20-1310.html
http://jmlr.org/papers/volume24/20-1310/20-1310.pdf
2023Kirandeep Kour, Sergey Dolgov, Martin Stoll, Peter Benner
An increasing amount of the collected data are high-dimensional multi-way arrays (tensors), and it is crucial for efficient learning algorithms to exploit this tensorial structure as much as possible. The ever present curse of dimensionality for high dimensional data and the loss of structure when vectorizing the data motivates the use of tailored low-rank tensor classification methods. In the presence of small amounts of training data, kernel methods offer an attractive choice as they provide the possibility for a nonlinear decision boundary. We develop the Tensor Train Multi-way Multi-level Kernel (TT-MMK), which combines the simplicity of the Canonical Polyadic decomposition, the classification power of the Dual Structure-preserving Support Vector Machine, and the reliability of the Tensor Train (TT) approximation. We show by experiments that the TT-MMK method is usually more reliable computationally, less sensitive to tuning parameters, and gives higher prediction accuracy in the SVM classification when benchmarked against other state-of-the-art techniques.
Bayesian Spiked Laplacian Graphs
http://jmlr.org/papers/v24/20-1206.html
http://jmlr.org/papers/volume24/20-1206/20-1206.pdf
2023Leo L Duan, George Michailidis, Mingzhou Ding
In network analysis, it is common to work with a collection of graphs that exhibit heterogeneity. For example, neuroimaging data from patient cohorts are increasingly available. A critical analytical task is to identify communities, and graph Laplacian-based methods are routinely used. However, these methods are currently limited to a single network and also do not provide measures of uncertainty on the community assignment. In this work, we first propose a probabilistic network model called the ”Spiked Laplacian Graph” that considers an observed network as a transform of the Laplacian and degree matrices of the network generating process, with the Laplacian eigenvalues modeled by a modified spiked structure. This effectively reduces the number of parameters in the eigenvectors, and their sign patterns allow efficient estimation of the underlying community structure. Further, the posterior distribution of the eigenvectors provides uncertainty quantification for the community estimates. Second, we introduce a Bayesian non-parametric approach to address the issue of heterogeneity in a collection of graphs. Theoretical results are established on the posterior consistency of the procedure and provide insights on the trade-off between model resolution and accuracy. We illustrate the performance of the methodology on synthetic data sets, as well as a neuroscience study related to brain activity in working memory.
The Brier Score under Administrative Censoring: Problems and a Solution
http://jmlr.org/papers/v24/19-1030.html
http://jmlr.org/papers/volume24/19-1030/19-1030.pdf
2023Håvard Kvamme, Ørnulf Borgan
The Brier score is commonly used for evaluating probability predictions. In survival analysis, with right-censored observations of the event times, this score can be weighted by the inverse probability of censoring (IPCW) to retain its original interpretation. It is common practice to estimate the censoring distribution with the Kaplan-Meier estimator, even though it assumes that the censoring distribution is independent of the covariates. This paper investigates problems that may arise for the IPCW weighting scheme when the covariates used in the prediction model contain information about the censoring times. In particular, this may occur for administratively censored data if the distribution of the covariates varies with calendar time. For administratively censored data, we propose an alternative version of the Brier score. This administrative Brier score does not require estimation of the censoring distribution and is valid also when the censoring times can be predicted from the covariates.
Approximation Bounds for Hierarchical Clustering: Average Linkage, Bisecting K-means, and Local Search
http://jmlr.org/papers/v24/18-080.html
http://jmlr.org/papers/volume24/18-080/18-080.pdf
2023Benjamin Moseley, Joshua R. Wang
Hierarchical clustering is a data analysis method that has been used for decades. Despite its widespread use, the method has an underdeveloped analytical foundation. Having a well-understood foundation would both support the currently used methods and help guide future improvements. The goal of this paper is to give an analytic framework to better understand observations seen in practice. This paper considers the dual of a problem framework for hierarchical clustering introduced by Dasgupta. The main result is that one of the most popular algorithms used in practice, average linkage agglomerative clustering, has a small constant approximation ratio for this objective. To contrast, this paper establishes that using several other popular algorithms, including bisecting $k$-means divisive clustering, have a very poor lower bound on its approximation ratio for the same objective. However, we show that there are divisive algorithms that perform well with respect to this objective by giving two constant approximation algorithms. This paper is some of the first work to establish guarantees on widely used hierarchical algorithms for a natural objective function. This objective and analysis give insight into what these popular algorithms are optimizing and when they will perform well.