Machine Learning, etc: August 2010

After following some discussions on overflow sites, I re-read Shannon/Cover's coverage of method of types, and want to summarize it here because of how useful and underappreciated it is.

A type or type class is basically an empirical distribution for a sequence of n independent events. For instance, suppose we flip coin n=2 times. We have 3 type classes, all heads, all tails, one head/one tail. Size of the type class is the number of sequences in that class. Here, two of our classes have size 1 and "one head/one tail" has size 2.

Since the number of types grows polynomially with n, while number of observation sequences grows exponentially, some type classes end up getting an exponential number of sequences. Since constants of exponential growth are different for different types, probability mass ends up getting concentrated in a small number of types and several important properties follow from this concentration behavior.

In particular, Thomas/Cover use method of types to formalize and prove the following statements in Chapter 12 of "Information Theory and Statistics" (1991).

Probability of hypothesis E under distribution P is determined by the probability of closest member of E to P (Sanov's theorem)
To test whether data x was generated by p or q, it's sufficient to only look at the ratio of probabilities of x under p and under q (Neyman-Pearson lemma)
When data is generated by p or q and you are trying to tell which one it is from data, probability of making a mistake for an optimal classifier is bounded by exp(-n max(KL(p,q),KL(q,p)). A tighter bound is to use KL(x,p)=KL(x,q) where x is the "halfway point" between distributions (Chernoff's theorem)

The last point actually gives an alternative (more intuitive IMHO) motivation for Fisher's information metric. It's known that when w1 and w2 are close, KL-divergence between parametrized distributions p(w1) and p(w2) behaves as the square of Mahalonobis distance between w1 and w2 with Fisher Information Matrix as the covariance matrix. So closeness in Fisher Information Metric now corresponds to difficulty of telling two distributions apart based on data. This notion can be used to define the volume of a probability distribution family by counting "balls" of indistinguishable probability densities. It seems like a very "aesthetically" appealing approach to model selection, although I haven't seen it used much in practice, perhaps because so few people in machine learning know differential geometry.

Rodriguez says it outperforms BIC and AIC by orders of magnitude

Resources

Thomas, Cover "Information Theory and Statistics"
Csiszar, Shields Notes on Information Theory and Statistics

Machine Learning, etc

Monday, August 30, 2010

New Machine Learning blog

Sunday, August 15, 2010

Method of Types

Tuesday, August 10, 2010

Interesting Stack Exchanges