Monday, February 28, 2005

Blogs of Note

From Geomblog -- "Machine learning blogs appear to be popping up faster than you can say 'expectation-maximization'". With that note, here are a few machine learning (and related) blogs I came across recently



Saturday, February 26, 2005

Knowledge Intensive Learning

When prior knowledge is given in the form of a prior, Bayesian method is the rational way of making decisions, based on Cox' arguments. However, prior knowledge doesn't usually come in a form of a prior. For instance an expert may say that X and Y are independent, or that increasing value of X will influence Y to increase. We need to figure out how to use those kinds of knowledge to bias our learners, and this is where Knowledge Intensive Learning comes in. Eric Altendorf does research in the area and is doing a series of posts on it in his new blog.

Friday, February 25, 2005

Naive Bayes to fight HIV?

Quoting -- "Microsoft scientists David Heckerman and Nebojsa Jojic were the first to pioneer the medical uses of anti-spam software. They have joined up with doctors and scientists from the University of Washington and Australia's Royal Perth Hospital to build more potent vaccines based on the data gathered by Microsoft's technologies."


Thursday, February 24, 2005

How to browse the web while pretending to do Machine Learning research

  • What are some other ways of finding interesting stuff?

Tuesday, February 22, 2005


I came across CiteULike on Seb's Open Research blog a week ago, tried it out, and it seems very useful. Basically it's a cross between bibliography management software like JafRef, and social bookmarking like You can use it to keep track of the papers you find, and export in the BibTex format. Then you can see what other people added given paper, and see what papers they are reading. You can also add comments to papers and view other people's comments (not many at this point). Suresh has a more detailed review.


Thursday, February 17, 2005

Machine Learning reading groups

If you are looking for ideas for things to read, you should check out what other Machine Learning Reading Groups are reading. I compiled the following list by searching google/asking people.

This list is not very up to date. This is a permalink so feel free to link to it. Please let me know through email, or comment section about other up-to-date MLRGs you know (site must list papers)

Current MLRGs
(as of Feb 19th 2005)
  • Oregon State U (Roweis, Tishby, Jordan, Collins, Caticha)-- mlrg
  • Gatsby (McKay, ARD, LDA) -- mlrg
  • Humboldt-Universitat zu Berlin (ICML, KDD papers)-- mlrg
  • TTIC Chicago list of readings (Langford, Kakade, McAllester)seminars
  • University of New Mexico (Jaakkola, Mooney, Scholkopf) -- mlrg
  • Oregon Health and Science U. (Seeger, Geman, MacKay) -- mlrg
  • Washington University (Weiss, Srebro) -- mlrg
  • MIT (Neal, Hofmann, Tishby) -- mlrg
  • University of California, Irvine (Blei, Jordan, Welling) -- mlrg
  • University of Amsterdam (Manning, Koller, Collins, Valiant) mlrg
  • Hebrew University (IB, SVM, Rademacher, Chechnik) mlrg
  • Cornell (Jordan, Schuurmans,Cristianini) -- mlrg
  • South Carolina (Kakade, Dresner, Feigenbaum) mlrg
  • U. Alberta (Sutton, RL, MaxEnt)-- mlrg
  • Weizemann Institute, (vision oriented, Jordan, Buntine) -- mlrg
  • Helsinki U of Tech (mostly neuroscience, Sejnowski) -- mlrg
  • University of London (vision oriented) -- mlrg
  • Rutgers (Michael Jordan's notes) -- mlrg
  • UMass (vision oriented, ICA, kernel PCA) -- mlrg
  • Edinburgh (NLP group, Hoffman, Manning,Collins) -- mlrg
  • U. of California, San Diego (Tishby, Lafferty, Joachims, Herbrich) -- mlrg
  • UTexas (Aggarwal, Altun, Manning) -- mlrg
Seminars,Journal Clubs,slightly outdated MLRGs
  • Columbia (November) -- mlrg
  • U.Toronto (July)-- mlrg
  • Tufts University (December) -- mlrg
  • Australian National U. (December)-- mlrg
  • Centre Universitaire d'Informatique (June) --journal club
  • Microsoft Research (Herbrich, large-margin, factor graphs, kernels) -- mlrg
  • CWI Journal Club (Netherlands) journal club
  • Seminars at NYU (Langford, Saul) seminars
  • Graphical Models Reading Group at CMU (about 10 months old) mlrg

Who is being read?
Below is the list of most popular authors ranked by how many groups in the "current MLRGs" section had them listed:

9 - Koller
8 - Jordan
7 - Weiss
6 - Taskar, Saul, McCallum, Lafferty, Ghahramani, Friedman, Collins
5 - Roweis, Pereira, Ng, Domingos
4 - Zhu, Tishby, Russell, Roth, Li, Kearns, Jaakkola, Guestrin, Blei
3 - Zhou, Yair, Wong, Wellner, Tenenbaum, Slonim, Singh, Singer, Scholkopf, Schapire, Mooney, Mansour, Manning, MacKay, Joachims, Hofmann, Griffiths, Freund, Freeman, Dengyong, Breiman, Bartlett, Altun

  • What other reading groups have websites?
  • What are other ways of finding interesting papers?

Saturday, February 12, 2005

Tensor Methods for Optimization

Second-order methods use an exact Hessian, or a compact approximation to direct the search for the minimum. They are called second-order because the Hessian they use, ie, second derivative, is the second order term in the Taylor expansion of the function. What if our function is not a quadratic, and we wanted to guide our minimization using some 3rd or 4th order information? That's where tensor methods come in.


Main idea to Schnabel's tensor method is to select a set of p points from previous evaluations such that they are mostly linearly independent (each new direction makes at least a 45 degree angle with old direction). The idea is then to try to interpolate function values at those points by fitting a 4th order model with special form of 3rd and 4th order terms. Because of the restricted form of these terms, the overhead associated with this step is not significant over the standard Newton-Gauss algorithm. Because it approximates function with 4th order expansion, the number of problems for which this method converges was larger. Schnabel also proves superlinear convergence of this method for some cases when Newton's method converges linearly.


Seth Greenblatt tried tensor method on some econometric models, and results were 1.5-3 times reduction in number of function evaluations. He compared it against Newton-Rhapson method, using exact matrix inversion (O(n^3)) and the extra overhead of for the tensor method wasn't significant. The problems he tested on were econometric models including Lucas-Rapping model and Berndt-Wood KLEM model. Lucas-Rapping model in his thesis was basically logistic regression with some econometric semantics.

Schnabel's paper in SIAM J Opt also used explicit Hessian, and O(n^3) steps to invert it. His 1998 paper introduces implementation of tensor method, TenSolve, which uses same approach as Gauss-Newton to approximate second order term from gradients, which means O(n^2) space and time requirement. Asymptotic overhead of using tensors was O(n^1.5)


TenSolve stores the approximation to the Hessian explicitly, so that takes O(n^2) space. It seems conceptually possible to reduce storage to O(n) if we use a method analogous to limited memory BFGS to store the Hessian. If that worked, it would result in a tensor optimization method with O(n) storage complexity and O(n^1.5) computation complexity

  • What are some applications of Machine Learning where an O(n^1.5) method would be acceptable?
  • What would be a good dataset on which to test CRF training using this method?


  • Schnabel, Robert B and Ta-Tung Chow (1991) "Tensor Methods for Unconstrained Optimization Using Second Derivatives," SAIM J. Optimization 1,3,293-315
  • Ali Bouaricha, Robert B. Schnabel: Algorithm 768: TENSOLVE: A Software Package for Solving Systems of Nonlinear Equations and Nonlinear Least-Squares Problems Using Tensor Methods. ACM Trans. Math. Softw. 23 (2): 174-195 (1997)
  • Greenblatt, "Tensor Methods for Full-Information Maximum Likelihood Estimation", PhD thesis, George Washington University
  • TenSolve (Implementation of Schnabel's tensor optimization method):

Friday, February 11, 2005

Uninformative Priors

I came across an interesting article by Zhu and Lu talking about how non-informative priors can often be counter-intuitive. For instance suppose we are estimating parameter of a Bernoulli distribution. If we don't know anything, the most intuitive prior to use is uniform. However Jeffreys' "uninformative" prior doesn't look uniform at all:

This prior seems to favour values of theta close to 0 or 1, so how can it be uninformative?

The intuition they provide is the following: think of Maximum Likelihood estimator as the most uninformative estimate of parameters. Then we can rank informativeness of priors by how closely the Bayesian Posterior Mean estimate comes to the maximum likelihood estimate.

If we observed k heads out of n tosses, we get following estimates:

1. MLE: k/n
2. PM using Jeffrey's prior (k+1/2)/(n+1)
3. Laplace smoothing (Uniform) (k+1)/(n+2)
4. PM using Beta(a,b) prior (k+a)/(n+a+b)

You can see that using this informal criteria for informativeness, Jeffreys' prior is more uninformative than Uniform, and Beta(0,0) (the limiting distribution) is the least informative of all, because then Posterior Mean estimate will coincide with the MLE.


Wednesday, February 09, 2005

Why I started this blog

I've been doing research in Machine Learning (Conditional Random Fields and related) for the last year. This is a place for me to organize my thoughts and write down interesting new things I come across (but feel free to add comments). Some of the things I may be adding:

  1. Comments on papers
  2. Mini-tutorials on ML-related topics
  3. Links to online Machine Learning resources

It won't be linear in nature as I'll revise old posts when relevant information comes up. Maybe in the future I'll switch to a setup similar to Cosma Shalizi's notebooks, but at this point, blog is the most practical thing to do.


Monday, February 07, 2005

Lagrange multipliers: equality constraints

Constrained optimization often comes up in machine learning.

  1. Finding lengths of best instantaneous code when distribution of source is known
  2. Finding maximum entropy distribution subject to given constraints
  3. Minimizing distortion specified by the amount of information X provides about Y (Tishby's, Information Bottleneck method)
All of the examples above can be solved using the method of Lagrange multipliers. To understand motivation for Lagrange multipliers with equality constraints, you should check out Dan Klein's excellent tutorial.

However, you don't have to understand them to be able to go through the math. Here's how you can do it in 5 easy steps:

Write down the thing you are optimizing as J(x)
  1. Write down your constraints as g_i(x)=0
  2. Define L(x,l1,l2,...)=J(x) + l1 g1(x) + l2 g2(x) +...
  3. Differentiate L with respect to each variable and set to 0
  4. Solve this system of equations

Small Example

Suppose you want to maximize

subject to

Your problem looks like this:

(black line is where you solution must lie)

Then our only constraint function is g(x,y)=x-y+1=0. Our L (Langrangian) then becomes

differentiating and setting to 0 we get

Solve this system of 3 equations by substitution and you get x=-0.5, y=0.5, lambda=-1. The lagrange multiplier lambda_i represents ratio of the gradients of objective function J and i'th constraint function g_i at the solution point (that makes sense because they point in the same direction)

Bigger Example

For a more realistic example, consider maximizing entropy over a set of discrete distributions over positive integers where the first moment is fixed. Since we have an infinite number of possible X's, differentiating the Lagrangian gives us an countably infinite system of equations. But that's not actually a problem (unless you try to write them all down). Here's the detailed derivation

So you can see that highest entropy discrete distribution with fixed mean is geometric distribution. We can solve the same problem for continuous X's in the positive domain, and get exponential distribution. If we allow X's to span whole reals, there will be no solution for the same reason as why "uniform distribution over all reals" doesn't exist.


  1. What to use for complicated 3d graphs? I tried Mathematica, but not sure how to combine two 3d graphs into one
  2. Can anyone do Exercise 1. in Chapter 5 of Christianini's SVM book?
  3. If we use Robinson's non-standard Analysis, could we sensibly define "uniform distribution over all integers"?


Saturday, February 05, 2005

Interesting blogs, newsgroups, chatrooms

Here are some things that someone who likes math, Machine Learning or statistics may find interesting




Mailing Lists:


  • #MachineLearning

Finding Blogs:

After looking at "Referring Link" stats of people browsing this blog, I discovered two new ways of finding relevant blogs

I found a few people who came to this blog after searching for mentions of , and their IP's traced to Illinois, I wonder why ;)

Feel free to post other cool blogs, or other methods of finding them

Thursday, February 03, 2005

Getting probabilities from students

A comment on Computational Complexity blog blog inspired me to look at the following problem

Suppose you give true/false test to students and want them to put down their subjective probabilities that answer is true/false. To make them honest, you would want to award points in such a way, that in order to maximize their expected points, the student would need to report their correct probabilities. Which utility function to use? It turns out, f(x)=A log x + B is one such family. Here's the derivation

1) Does any other such utility function exist?
2) What are other fun ways to torture students?

Here are graphs of three possible functions that can be used for eliciting probabilities from students. They are all concave, maybe one can prove any suitable function must be concave?