Saturday, February 12, 2005

Tensor Methods for Optimization

Second-order methods use an exact Hessian, or a compact approximation to direct the search for the minimum. They are called second-order because the Hessian they use, ie, second derivative, is the second order term in the Taylor expansion of the function. What if our function is not a quadratic, and we wanted to guide our minimization using some 3rd or 4th order information? That's where tensor methods come in.


Description

Main idea to Schnabel's tensor method is to select a set of p points from previous evaluations such that they are mostly linearly independent (each new direction makes at least a 45 degree angle with old direction). The idea is then to try to interpolate function values at those points by fitting a 4th order model with special form of 3rd and 4th order terms. Because of the restricted form of these terms, the overhead associated with this step is not significant over the standard Newton-Gauss algorithm. Because it approximates function with 4th order expansion, the number of problems for which this method converges was larger. Schnabel also proves superlinear convergence of this method for some cases when Newton's method converges linearly.

Performance

Seth Greenblatt tried tensor method on some econometric models, and results were 1.5-3 times reduction in number of function evaluations. He compared it against Newton-Rhapson method, using exact matrix inversion (O(n^3)) and the extra overhead of for the tensor method wasn't significant. The problems he tested on were econometric models including Lucas-Rapping model and Berndt-Wood KLEM model. Lucas-Rapping model in his thesis was basically logistic regression with some econometric semantics.

Schnabel's paper in SIAM J Opt also used explicit Hessian, and O(n^3) steps to invert it. His 1998 paper introduces implementation of tensor method, TenSolve, which uses same approach as Gauss-Newton to approximate second order term from gradients, which means O(n^2) space and time requirement. Asymptotic overhead of using tensors was O(n^1.5)

Thoughts

TenSolve stores the approximation to the Hessian explicitly, so that takes O(n^2) space. It seems conceptually possible to reduce storage to O(n) if we use a method analogous to limited memory BFGS to store the Hessian. If that worked, it would result in a tensor optimization method with O(n) storage complexity and O(n^1.5) computation complexity

  • What are some applications of Machine Learning where an O(n^1.5) method would be acceptable?
  • What would be a good dataset on which to test CRF training using this method?


Resources:

  • Schnabel, Robert B and Ta-Tung Chow (1991) "Tensor Methods for Unconstrained Optimization Using Second Derivatives," SAIM J. Optimization 1,3,293-315
  • Ali Bouaricha, Robert B. Schnabel: Algorithm 768: TENSOLVE: A Software Package for Solving Systems of Nonlinear Equations and Nonlinear Least-Squares Problems Using Tensor Methods. ACM Trans. Math. Softw. 23 (2): 174-195 (1997)
  • Greenblatt, "Tensor Methods for Full-Information Maximum Likelihood Estimation", PhD thesis, George Washington University
  • TenSolve (Implementation of Schnabel's tensor optimization method): http://www.netlib.org/toms/768

3 comments:

Amir massoud Farahmand said...

Thanks! This post was interesting.

Yaroslav said...

Thanks! It's good to know someone is actually reading my posts

Amir massoud Farahmand said...

haha! yes! Be sure that someone reads your weblog. At least, I do so!
I think it is worthy to analyze the interaction of the weblog and its readers. It is not that easy though.
However, from my experiences in non-technical weblog writing in persian, I have found out that depending on the number of visitors of a weblog, the percent of readers who comment to the whole number of readers is something between 2% to 10%. This relation, in my opinion, is monotonically increasing (!) as a more visited weblog has more percent of commentators. I guess this percentage would be lower for academic weblogs.