Weighted SGD for $\ell_p$ Regression with Randomized Preconditioning

Jiyan Yang; Yin-Lam Chow; Christopher Ré; Michael W. Mahoney

Weighted SGD for $\ell_p$ Regression with Randomized Preconditioning

Jiyan Yang, Yin-Lam Chow, Christopher Ré, Michael W. Mahoney; 18(211):1−43, 2018.

Abstract

In recent years, stochastic gradient descent (SGD) methods and randomized linear algebra (RLA) algorithms have been applied to many large-scale problems in machine learning and data analysis. SGD methods are easy to implement and applicable to a wide range of convex optimization problems. In contrast, RLA algorithms provide much stronger performance guarantees but are applicable to a narrower class of problems. We aim to bridge the gap between these two methods in solving constrained overdetermined linear regression problems---e.g., $\ell_2$ and $\ell_1$ regression problems.

We propose a hybrid algorithm named PWSGD that uses RLA techniques for preconditioning and constructing an importance sampling distribution, and then performs an SGD-like iterative process with weighted sampling on the preconditioned system.
By rewriting a deterministic $\ell_p$ regression problem as a stochastic optimization problem, we connect PWSGD to several existing $\ell_p$ solvers including RLA methods with algorithmic leveraging (RLA for short).
We prove that PWSGD inherits faster convergence rates that only depend on the lower dimension of the linear system, while maintaining low computation complexity. Such SGD convergence rates are superior to other related SGD algorithm such as the weighted randomized Kaczmarz algorithm.
Particularly, when solving $\ell_1$ regression with size $n$ by $d$, PWSGD returns an approximate solution with $\epsilon$ relative error in the objective value in $\mathcal{O}(\log n \cdot \text{nnz}(A) + \text{poly}(d)/\epsilon^2)$ time. This complexity is uniformly better than that of RLA methods in terms of both $\epsilon$ and $d$ when the problem is unconstrained. In the presence of constraints, PWSGD only has to solve a sequence of much simpler and smaller optimization problem over the same constraints. In general this is more efficient than solving the constrained subproblem required in RLA.
For $\ell_2$ regression, PWSGD returns an approximate solution with $\epsilon$ relative error in the objective value and the solution vector measured in prediction norm in $\mathcal{O}(\log n \cdot \text{nnz}(A) + \text{poly}(d) \log(1/\epsilon) /\epsilon)$ time. We show that for unconstrained $\ell_2$ regression, this complexity is comparable to that of RLA and is asymptotically better over several state-of-the-art solvers in the regime where the desired accuracy $\epsilon$, high dimension $n$ and low dimension $d$ satisfy $d\geq 1/\epsilon$ and $n \geq d^2/\epsilon$.

We also provide lower bounds on the coreset complexity for more general regression problems, indicating that still new ideas will be needed to extend similar RLA preconditioning ideas to weighted SGD algorithms for more general regression problems. Finally, the effectiveness of such algorithms is illustrated numerically on both synthetic and real datasets, and the results are consistent with our theoretical findings and demonstrate that PWSGD converges to a medium- precision solution, e.g., $\epsilon=10^{-3}$, more quickly.