Wednesday, March 05, 2014

Stochastic Gradient Methods 2014

Last week I attended Stochastic Gradient Methods workshop held at UCLA's IPAM . Surprisingly, there's still quite a bit of activity and unsolved questions around what is essentially, minimizing a quadratic function.

In 2009 Strohmer and Vershinin rediscovered an algorithm used for solving linear systems of equations from 1970 -- Kaczmarz method, and showed that this algorithm is a form of Stochastic Gradient. This view of SGD motivates a biased sampling strategy which gives faster convergence rate than regular Stochastic Gradient. This spurred a flurry of activity, motivating results in at least 5 different lectures.

In 2010, Nesterov showed that Randomized Coordinate Descent has a faster convergence rate than SGD, and in 2013 Singer showed a way to accelerate it to quadratic convergence. In 2013 Richtarik gave an alternative algorithm to get the same convergence rate, but also comes up with better step sizes that rely on sparsity pattern of the problem.

Summaries of talks I attended with links to slides are below:

Ben Recht

Gave an overview of Hogwild and Jellyfish methods. Hogwild has been covered a few times before at NIPS, but here's an overview slide

Jellyfish (described in their Large Scale Matrix completion paper) chooses sampling order in a way to minimize lock contention.

Also talked about their work on explaining the gap between performance of SGD sampling with replacement vs. without replacement. Empirically, without replacement works better (see Section 5 of "Beneath the valley" paper) yet until recently tools were missing to explain it. They are able to prove faster rates of no-replacement sampling for Kaczmarz algorithm by relying on "Noncommutative arithmetic-geometric mean inequality."


  • Slides Recht - we should all run hogwild!.pdf
  • Beneath the valley of the noncommutative arithmetic-geometric mean inequality: conjectures, case-studies, and consequences.
  • Parallel Stochastic Gradient Algorithms for Large-Scale Matrix Completion. Recht and Re. 2011.
  • HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. Niu, Recht, Re, and Wright. 2011.

Yoram Singer

Talked about accelerating coordinate descent with momentum-like approach, dubbed Generalized Accelerated Gradient Descent. Nesterov's accelerated gradient method has quadratic convergence with linear dependence on condition number of the loss

Parallel coordinate descent depends on average of per-coordinate Lipschitz constants, which can be much better for badly conditioned loss:

The methods proposed has quadratic convergence of accelerated gradient, meanwhile retaining dependence on average curvature, rather than the worst


Dimitri Bertsekas

In-depth tutorial "Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization: A Unified Framework"

One slide that stuck out is the one-dimensional illustration of why SGD works.

In the farout region, all gradients are pointing in the same direction, so taking gradient step with respect to a single component function works just as well as looking at the full sum.

This also serves as the motivation for "heavy ball" method (Polyak, 1964). When you are in farout region, you want to accelerate, while in confusion region, you want to decelerate, you can accomplish this by modifying gradient update formula as follows

$$x_{x+1} = x_k-\alpha_k \nabla f_{i_k}(x_k)+\beta_k(x_k-x_{k-1})$$

This is similar in spirit to "Accelerated Stochastic Approximation" of Kesten (1958) which grows the step size if the difference between successive $x$'s is the same sign, and shrinks if there are many sign changes.

Schmidt said Stochastic Averaged Gradient works better than Kesten's approach in a multi-dimensional setting.


Peter Richtarik

Gave overview of his "Accelerated, Parallel and Proximal Coordinate Descent". It gives a technical improvement over his previous work "Distributed coordinate descent method for learning with big data" ( which seems to have the meat of the contributions.

Here's a slide from his talk comparing various methods.

"Prox" column means the algorithm can take proximal steps, i.e., can be used with constraints and not-nice regularizers. "Accel" or Accelerated is whether the method is enjoys $O(1/k^2)$ convergence rate where $k$ is the iteration counter. "General f" means it applies for convex problems rather than quadratic. "Block" is whether method can update some of the coordinates at a time rather than all coordinates.

Setting of the problem is summarized in slide below

You are optimizing a sum of losses $f_e$, and not all losses depend on all examples. You want to update your sets of coordinates in blocks, in parallel. Sets of variables involved for each $f_e$ determine how well you can parallelize the problem. In half-a-dozen papers on his website he develops framework dubbed Expected Separable Overapproximation (ESO) to analyze such problems.

One outcome of ESO approach is a formula that incorporates sparsity of the problem into calculation of step size. See Table 3 of his approx paper (

$$v_i = \sum_{j=1}^m (1+\frac{(\omega_j-1)(\tau-1)}{\max(1,n-1)}A_{ji}^2$$

This is formula for step size for coordinate $i$ in randomized coordinate descent, computed as a sum over examples $j$. Quantity $\omega_j$, is the number of components of vector that example $x_j$ depends on, $n$ is dimensionality, and $\tau$ is the number of coordinates are updated in parallel. $A$ is the matrix of quadratic minimization problem, replaced with matrix of per-coordinate Lipschitz constant for general convex problems.


Rachel Ward and Deanna Needell

Gave background on their paper "Stochastic Gradient Descent and the Randomized Kaczmarz Algorithm". The setting:

Further in presentation, developed importance sampling for SGD. Traditionally, SGD picks random component of the sum above, and the number of steps required to reach given accuracy is proportional to the worst condition number (Lipshitz constant) over per-example losses.

Derived following formula for the number of steps needed to reach given accuracy $\epsilon$ with uniform sampling

$$k \propto \log \epsilon (\sup_i \frac{L_i}{\mu} + \epsilon^{-1} \frac{\sigma^2}{\mu^2})$$

For quadratics, the first term is close to largest condition number out of all component functions $f_i$, except you are normalizing by global smallest eigenvalue $\mu$, rather than per-component smallest eigenvalue $\mu_i$. The second term is "normalized consistency" - expected squared norm of the squared gradient divided by smallest eigenvalue squared.

Instead of uniform sampling, we can sample examples in linear proportion to Lipshitz constant for gradient of loss on that example. This cuts down the number of steps needed to average Lipschitz constant, normalized by strong convexity parameter $mu$, rather than largest Lipschitz constant. Since Lipschitz constant is the upper bound on the largest eigenvalue of the Hessian, this means number of steps grows in proportion to average condition number rather than the largest condition number.

The term involving Lipschitz constant now drops from max to average. In other words we get this:
$$\sup_i \frac{L_i}{\mu} \to \frac{\bar{L}}{\mu}$$

The second (consistency) term can instead potentially get larger, we get
$$\frac{\sigma^2}{\mu^2}\to \frac{\bar{L}\sigma^2}{\inf_i L_i \mu^2}$$

The best trade off depends on details of function -- badly conditioned, but accurate gradients -- sample proportionally to Lipshitz. Well conditioned and noisy gradients, do closer to uniform. She shows that if we sample halways between uniform and Lipshitz, so called "partially biased sampling", both terms are guaranteed to be smaller than for uniform sampling.

Yuriy Nesterov obtained similar bounds for sampling strategy and convergence in his "Efficiency of coordinate descent methods on huge-scale optimization problems". Key difference is that he samples which coordinate to update at each step, instead of sampling examples. Optimal sampling strategy comes down to picking coordinates in linear proportion to their Lipschitz constant, and the convergence rate also drops to the average of per-coordinate Lipschitz constants rather than the worst Lipschtitz constant. Roughly speaking, number of steps till convergence goes down to average eigenvalue of Hessian rather than worst eigenvalue.

Deanna Needell gave background on the Kaczmarz Algorithm which gives alternative way to motivate importance sampling results. In particular, first few slides illustrate why the order matters. She also gives analytic expression to find the best next point to sample for the quadratic case. This requires $O(\text{# of examples})$ search at each iteration. She then shows approximation approach based on dimensionality reduction that takes $O(1)$ time per step.

Ben Recht's made a similar point on impact of choosing better ordering in his presentation


Stephen Wright

Started with a nomenclature discussion on how "Stochastic Gradient Descent" methods don't qualify as gradient descent, because SGD steps can be in ascent directions for the global cost function. Instead, they should be referred to as "Stochastic Gradient" methods. Every speaker afterwards corrected themselves on the usage.

Gave overview of parallel Kaczmarz method and then extended analysis to get convergence rate for parallel Kaczmars with "Inconsistent Read" allowed -- situation where the parameter vector gets modified while it's being read.


Yann LeCun

Gave background on convolutional neural networks and showed demo of online learning using ImageNet. Basically it was network running network pre-trained on ImageNet, and using nearest neighbor in the embedding induced by activations of the last layer.

Impressively, it seems to do a good job learning to recognize from just a single example.

Also talked about connections between neural network learning and random matrix theory. You can see the connection if you rewrite activations of ReLU neural network as follows

$$\sum_P C(x) \prod_{(i,j) \in P} W_{i,j}$$

The sum is over all paths through active nodes from input layer to output node. Coefficients $C_x$ depend on input data. This is a polynomial with degree equal to the number of layers, and there are results from random matrix theory says that if coefficients $C(x)$ are Gaussian distributed, then local minima are close together in energy, so essentially, finding local minimum is as good as finding global minimum.


Francis Bach

Presented results on convergence rates of SGD and how they are affected by lack of strong convexity.


Jorge Nocedal

Talked about adaptation of quasi-Newton method to stochastic setting. Convergence of SGD depends on square of condition number, meanwhile Newton's method is independent of condition number, at the cost of step size that that costs $O(\text{dimensions}^2)$

The compromise he proposes is to do BFGS-like method where
1. You use exact Hessian information to compute product of Hessian and step direction
2. You only do it once every 20 iterations

This makes the cost of l-BFGS-like update similar to SGD update.


Asuman Ozdaglar

Introduced a way to extend ADMM to graph-structured problems without having to choose the order of updates. The setting of problem is summarized below

As you may recall, ADMM works by decoupling components of the loss by having each loss operate on their own copy of the parameters. You alternate between each function minimizing itself locally with their own copy of parameters, and setting values of parameters locally from the functions that have already been minimized.

This steps can be implemented as message passing on a factor graph -- factors here are components of the cost function, whereas nodes are variables that the cost function depends on. Each component function depends on a subset of variable, and each variable is involved ina  subset of component functions.

Implementation of ADMM is similar to Divide and Concur, where a readable overview is given in Yedidia's Message-Passing paper.

One inconvenience of this approach is that it requires establishing an arbitrary order of message updates.

Ozdaglar's idea is to introduce the symmetry by adding extra variables, one for each direction of the constraint variable, and adding an extra constraint that forces them to agree. The update is done in parallel, like parallel BP, followed by an extra step that synchronizes the extra constraint variables.


John Duchi

John Ducchi gave a white-board talk on convergence of 0-th order optimization. Happily, the convergence is only a factor of sqrt(dimensions) worse than standard SGD.

Started with succinct derivation of non-asymptotic error of proximal average algorithm, which looks a lot like averaged SGD, after $k$ steps, in terms of errors of gradients. The actual formula has no O-terms and proof is found in notes, but roughly it looks like this

$$E(\text{error}) <= O(\frac{1}{\sqrt{k}})+\frac{1}{k}\sum_{i=1}^{k} E[\|\epsilon_i\|]$$

Error here is in terms of value of the function, which is what we care about in applications (as opposed to distance from true parameter vector). As $k$ increases, the second term vanishes and you get the regular $1/\sqrt{k}$ convergence. If you don't care about constraints, "prox" step can be replaced by an SGD step.


Mark Schmidt

Gave an overview of their Stochastic Averaged Gradient algorithm. Full details and many extensions are in their hefty 45 page arxiv paper.

Their motivation is a method to combine fast initial convergence for stochastic method, and fast late-stage convergence of full-gradient methods, while keeping cheap iteration cost of stochastic gradient.

Stochastic Averaged Gradient reaches this goal with a simple modification of stochastic gradient. The idea is that at each gradient step, in addition to the gradient computed for the current data point, you also add up all the gradients computed on previous datapoints. Those gradients may be out of date, but for strongly convex loss with convex component functions, this staleness doesn't hurt.

Schmidt et al advocated sampling datapoints with high curvature more often based on the argument that such gradient might be changing faster, and needs to be evaluated more often. However, the formal justification of this intuition is not avaiable, and instead they fall back on the same analysis as Kaczmarz importance sampling described earlier.

One difference of weighted sampling from standard SGD setting is that examples can be sampled more often without needing to correct for this bias because the weight of each gradient is $1/n$ in SAG regardless of how many times the function is sampled. However, bias correction will come up as an issue in any large scale adaptation of SAG when you can't store all gradients in memory.


Lin Xiao

Gave an overview of stochastic variance reduction gradient methods. The idea of variance reduction is to periodically evaluate full gradient, and then use it to adjust future gradient steps. If we evaluated full gradient at previous point $\tilde{x}$, formula for gradient update becomes as follows

$$x_{k+1}=x_k - \nu (\nabla f_{i_k} - \nabla f_{i_k}(\tilde{x})+\nabla F(\tilde{x}))$$

Here $\nabla{F(\tilde{x})}$ is the full gradient evaluated at some previous point $\tilde{x}$, $\nabla{f_{i_k}(x_k)}$ is the gradient evaluated at loss for current example $x_k$.

The idea of variance reduction is illustrated below.

On the left you see what would happen if you applied gradient reduction formula with $\nabla{F(\tilde{x})}$ computed at each step. That reduces to regular gradient descent. If instead we evaluate full gradient once every $k$ iterations, the correction will be based on stale value of gradient and not quite correct, however the mean error is zero so it gives an unbiased estimate of the correction term.

Then they introduce a weighted sampling strategy, where datapoints are sampled proportionally to the condition number of individual loss functions. When number of iterations is much larger than number of examples, weighted sampling strategy drops convergence to $O(C_{\text{avg}}$ steps as opposed to $O(C_\max)$ steps for uniform sampling, C_{\text{avg}} is the average condition number over all component loss functions.


James Spall

Gave results on Stochastic Approximation methods. Aproximation can be seen as minimization of distance between solution and ideal solution, so SA methods come down to some form of stochastic optimization. The difference is that the setting is more general - non-convexity, can't compute gradients, possibly discrete problem.

Standard approach to derivative free methods is Finite Difference Stochastic Approximation (FDSA) where to numerically compute gradient, you do $2p$ function evaluations where $p$ is dimensionality.

The idea of Simultaneous Perturbation Stochastic Approximation method (SPSA) is to evaluate gradient along randomly chosen directions, and take step in that directions with step-length proportional to gradient in that direction. This requires two function evaluations instead of $2 p$ for FDSA, and works just as well.

Two summary slides from the SPSA talk:

Here was a graph of numerical simulation of SPSA vs standard approach

He gave a more in-depth overview of the methods in 2012 NIPS talk. It's available as youtube video, but here are screenshots of some intro slides.

Simple SPSA is essentially a first order method, and has the same problems as other first order methods:

  • sensitivity to upscaling of units of $\theta$
  • slow convergence in the final phase

To address these, he introduces Adaptive Stochastic Approximation by the Simultaneous Perturbation Method (ASP) which goes further by numerically estimating the Hessian in addition to the gradient.

The approach to approximating Hessian is similar in spirit to SPSA -- compute gradient in two random directions, and estimate the Hessian numerically from that (formula 2.2 in "Adaptive Stochastic Approximation by the Simultaneous Perturbation Method"). This requires 4 function evaluations.

This estimate is noisy, so use momentum to smooth Hessian estimates.

More recent work (Spall 2009) gives an improved formula for estimating the Hessian numerically using what he calls "feedback term".

Adaptive SPSA methods store Hessian approximation explicitly, like BFGS, so they aren't directly applicable to deep neural nets.