Sunday, August 15, 2010

Method of Types

After following some discussions on overflow sites, I re-read Shannon/Cover's coverage of method of types, and want to summarize it here because of how useful and underappreciated it is.

A type or type class is basically an empirical distribution for a sequence of n independent events. For instance, suppose we flip coin n=2 times. We have 3 type classes, all heads, all tails, one head/one tail. Size of the type class is the number of sequences in that class. Here, two of our classes have size 1 and "one head/one tail" has size 2.

Since the number of types grows polynomially with n, while number of observation sequences grows exponentially, some type classes end up getting an exponential number of sequences. Since constants of exponential growth are different for different types, probability mass ends up getting concentrated in a small number of types and several important properties follow from this concentration behavior.

In particular, Thomas/Cover use method of types to formalize and prove the following statements in Chapter 12 of "Information Theory and Statistics" (1991).
  • Probability of hypothesis E under distribution P is determined by the probability of closest member of E to P (Sanov's theorem)
  • To test whether data x was generated by p or q, it's sufficient to only look at the ratio of probabilities of x under p and under q (Neyman-Pearson lemma)
  • When data is generated by p or q and you are trying to tell which one it is from data, probability of making a mistake for an optimal classifier is bounded by exp(-n max(KL(p,q),KL(q,p)). A tighter bound is to use KL(x,p)=KL(x,q) where x is the "halfway point" between distributions (Chernoff's theorem)
The last point actually gives an alternative (more intuitive IMHO) motivation for Fisher's information metric. It's known that when w1 and w2 are close, KL-divergence between parametrized distributions p(w1) and p(w2) behaves as the square of Mahalonobis distance between w1 and w2 with Fisher Information Matrix as the covariance matrix. So closeness in Fisher Information Metric now corresponds to difficulty of telling two distributions apart based on data. This notion can be used to define the volume of a probability distribution family by counting "balls" of indistinguishable probability densities. It seems like a very "aesthetically" appealing approach to model selection, although I haven't seen it used much in practice, perhaps because so few people in machine learning know differential geometry.

Rodriguez says it outperforms BIC and AIC by orders of magnitude



draj said...

Excellent machine learning blog,thanks for sharing...
Seo Internship in Bangalore
Smo Internship in Bangalore
Digital Marketing Internship Program in Bangalore

salwar kameez usa said...

Faisalabad is one of the biggest cities in Pakistan and the hub of the textile industry. It is widely acknowledged as the Manchester of Pakistan due to its large industrial role. The quality of the fabrics produced in this city has no parallel. In fact, the fabric is something of a specialty of Faisalabad. Many people from all over the country flock to this city for a spot of cloth shopping. We aim to provide you all of the best of Faisalabad at our store.

lakshmibhucynix said...

I really liked your blog article.Really thank you! Really Cool.
Machine Learning Online Training In Hyderabad