## Stochastic Gradient Descent as Approximate Bayesian Inference

*Stephan Mandt, Matthew D. Hoffman, David M. Blei*; 18(134):1−35, 2017.

### Abstract

Stochastic Gradient Descent with a constant learning rate
(constant SGD) simulates a Markov chain with a stationary
distribution. With this perspective, we derive several new
results. (1) We show that constant SGD can be used as an
approximate Bayesian posterior inference algorithm.
Specifically, we show how to adjust the tuning parameters of
constant SGD to best match the stationary distribution to a
posterior, minimizing the Kullback-Leibler divergence between
these two distributions. (2) We demonstrate that constant SGD
gives rise to a new variational EM algorithm that optimizes
hyperparameters in complex probabilistic models. (3) We also
show how to tune SGD with momentum for approximate sampling. (4)
We analyze stochastic-gradient MCMC algorithms. For Stochastic-
Gradient Langevin Dynamics and Stochastic-Gradient Fisher
Scoring, we quantify the approximation errors due to finite
learning rates. Finally (5), we use the stochastic process
perspective to give a short proof of why Polyak averaging is
optimal. Based on this idea, we propose a scalable approximate
MCMC algorithm, the Averaged Stochastic Gradient Sampler.

[abs][pdf][bib]