Thursday, October 28, 2010

Sometimes simplest learners are best -- experiment

In 1993, R.Holte noted that "Very simple classification rules perform well on most commonly used datasets". His very simple rules are basically binary classifiers that make decision based on value of a single attribute.

In the race for latest and greatest classifiers it's easy to forget that many kinds of classification tasks in 1993 still come up today, so conclusions of Holte's paper still apply.

I came across this effect couple of years ago on a set of document categorization experiments. Folks at WinnowTag wanted me to look into various methods for automatically tagging news/blog items. The idea was that a user manually tags some articles, the system learns and gradually takes over the role of tagging. You can approach it as a straight-forward binary classification task since number of errors represents the number of tag additions/deletions needed from user to fix the tagging.

At first I thought about jumping in and coding some latest and greatest approach for multi-tag labeling like one in from NIPS'04 paper, but as a sanity check, I decided to try existing Weka's classifiers on the dataset.

Nice thing about Weka is that it provides a unified command line interface to all of its classifiers, so you could make a simple Python script like this to run 64 different classifiers on your classification task with default parameters provided by Weka.

4000 hours of EC2 runs later, I got the results. Simplest classifiers like Naive Bayes, OneR (from Holt's paper) made the smallest number of errors. Weka's BayesNet, J48, VotedPerceptron, LibSVM performed as bad or worse as predicting "False" for every tag. Logistic Regression, Neural Networks, AdaboostM1 failed to finish after a day of training.

In retrospect, it makes sense -- in the original dataset there were 35 tags and 5-40 documents per tag, sometimes as small as a couple of sentences, and with the amount of inherent noise in tagging, that's simply not enough data to identify some of the deeper patterns that higher complexity classifiers are made for. In this settings, I wouldn't expect "advanced" classifiers to give a significant improvement over Naive Bayes even after significant tuning of hyper-parameters

It's been suggested that more data beats better algorithms, but here, I would say that when you have little data/prior knowledge, simplest learners work as well as the most advanced ones.

PS: $1 for the caption to the strip above

Tuesday, October 26, 2010

Times they are a'changin'

Suresh points out that at this year's FOCS, not a single person wanted printed proceedings, whereas few years ago, a third of the audience would ask for printed version.

I see a similar shift happening with technical books. Purists say that you can't beat the convenience of browsing a real book, but I say that you can't beat the convenience of having access to all your books wherever you go. In the last 6 years, I came across about a hundred technical books I considered interesting enough to get an electronic version of, and since publishers didn't provide electronic versions, this often meant me going to the library and scanning the book myself. One of my first blog posts was on the subject of book scanning.

Today, almost 6 years later, I almost never need to use the scanner, because almost every book I need is already available in electronic form, scanned by someone and circulating freely on the Internet. Increasing number of those illegally circulating books are not even scans, but look like bona fide original electronic versions. Number of legally available electronic versions is also increasing, recent examples I came across are Boyd's Convex Optimization, McCallaugh's Tensor Methods in Statistics and Szeliski's Computer Vision: Algorithms and Applications. Times they are a'changin' !

Friday, October 22, 2010

ICML topic trends

David Mimno fit a Dirichlet-multinomial to ICML papers 2004-2008. Seems like "real world" problems are going strong, while boosting took a serious dive

Wednesday, October 13, 2010

Why do we need integrals in Computer Science?

A comment on previous post asked why we need integrals for computer science. One reason is that combinatorial expressions often have representation in terms of integrals. Consider binomial coefficients. We have the following

$${n \choose \frac{n+d}{2}}=\frac{1}{2 \pi} \int_{-\pi}^\pi e^{-\mathbb{i} q d} (2\cos q)^n dq$$

To see where this comes from, consider that $2 \cos x=e^{-\mathbb{i}\pi}+e^{\mathbb{i}\pi}$ is the generating function for number of one step walks starting at 0. Coefficients give number of those walks ending at 1 and -1. Raise it to $n$ and apply inverse Fourier transform to get number of $n$ step walks ending at at $d$, which has an alternative representation in terms of binomial coefficient.

Here's what the result looks like for arbitrary $d$ with $n=10$

Another way to get binomial coefficients is to note that they form coefficients of power of $x$ in expansion of $(1+x)^n$. Use Cauchy's integral formula to recover the coefficient.

We get the following

$${n \choose k} = \frac{1}{2\pi \mathbb{i}}\oint_c \frac{(1+x)^n}{x^{k+1}}$$

Contour integral is counter-clockwise around some contour that encircles the origin.
For instance, you can use the following contour.

Mathematica notebook

Sunday, October 03, 2010

Theoretical CS cheat sheet

Thanks to John Cook for pointing it out

Saturday, October 02, 2010

Universal Laws and Computational Irreducibility

  • Terry Tao's article on universality.
  • Stephen Wolfram's speech on future special functions.

They give opposing perspective -- in the end of speech Stephen Wolfram argues that most processes in universe are "computationally irreducible" and we have no hope to simulating them accurately, whereas Terry Tao's article gives examples of many things in real world obeying a small set of simple laws