Gradient Methods Never Overfit On Separable Data
Ohad Shamir; 22(85):1−20, 2021.
Abstract
A line of recent works established that when training linear predictors over separable data, using gradient methods and exponentially-tailed losses, the predictors asymptotically converge in direction to the max-margin predictor. As a consequence, the predictors asymptotically do not overfit. However, this does not address the question of whether overfitting might occur non-asymptotically, after some bounded number of iterations. In this paper, we formally show that standard gradient methods (in particular, gradient flow, gradient descent and stochastic gradient descent) *never* overfit on separable data: If we run these methods for T iterations on a dataset of size m, both the empirical risk and the generalization error decrease at an essentially optimal rate of ˜O(1/γ2T) up till T≈m, at which point the generalization error remains fixed at an essentially optimal level of ˜O(1/γ2m) regardless of how large T is. Along the way, we present non-asymptotic bounds on the number of margin violations over the dataset, and prove their tightness.
© JMLR 2021. (edit, beta) |