Machine Learning, etc: November 2011

Monday, November 21, 2011

Interesting papers coming up at NIPS'11

There's a number of accepted papers whose camera-ready versions have been posted already. Here are the ones I found interesting. I'll give further update on these after the conference.

Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials, P. Krähenbühl, V. Koltun
Fast and Accurate k-means For Large Datasets, M. Shindler, A. Wong, A. Meyerson
Hashing Algorithms for Large-Scale Learning, P. Li, A. Shrivastava, J. Moore, A. König
Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent, B. Recht, C. Re, S. Wright, F. Niu
Generalizing from Several Related Classification Tasks to a New Unlabeled Sample, G. Blanchard, G. Lee, C. Scott
How biased are maximum entropy models? J. Macke, I. Murray, P. Latham
On Tracking The Partition Function, G. Desjardins, A. Courville, Y. Bengio
Selecting Receptive Fields in Deep Networks, A. Coates, A. Ng
Shallow vs. Deep Sum-Product Networks, O. Delalleau, Y. Bengio
Statistical Tests for Optimization Efficiency, L. Boyles, A. Korattikara, D. Ramanan, M. Welling

Additionally, here's a non-NIPS preprint I came across that summarizes belief propagation and related algorithms. In particular, it turns out that state of the art algorithm for sphere packing, Divide and Concur is closely related to belief propagation.

Message-passing Algorithms for Inference and Optimization, Jonathan S. Yedidia

Sunday, November 13, 2011

Shapecatcher

Here's a cool tool I stumbled across reading John Cook's blog -- Shape Catcher looks up Unicode value from a drawing of a character.

Apparently it uses Shape Context features.

This motivated me to put together another dataset, unlike notMNIST this focuses on the tail end of Unicode, this is 370k bitmaps representing 29k Unicode values, grouped by Unicode

Unicode 370k

Wednesday, November 09, 2011

Google1000 dataset

This is a dataset of scans of 1000 public domain books that was released to the public at ICDAR 2007. At the time there was no public serving infrastructure, so few people actually got the 120GB dataset. It has since been hosted on Google Cloud Storage and made available for public download

http://commondatastorage.googleapis.com/books/icdar2007/README.txt
http://commondatastorage.googleapis.com/books/icdar2007/[filename]
[filename] goes from Volume_0000.zip to Volume_0999.zip

To download it, ie, with Python on Linux box

import os
for i in range(1000):
 if count>1: break
 fname = 'Volume_%04d.zip'%(i)
 f = 'curl -L http://commondatastorage.googleapis.com/books/icdar2007/%s -o %s'%(fname,fname)
 os.system(f)

The goal of this dataset is to facilitate research into image post-processing and accurate OCR for scanned books.

Sunday, November 06, 2011

b-matching as improvement of kNN

Below is an illustration of b-matching from (Huang,Jebara AISTATS 2007) paper. You start with a weighted graph and the goal is to connect each v to k u's to minimize total edge cost. If v's represent labelled datapoints, u's unlabeled and weights correspond to distances, this works as a robust version of kNN classifier (k=2 in the picture) because it prevents any datapoint from exhibiting too much influence.

They show that this restriction significantly improves robustness to changes in distribution between training and test set. See Figure 7 in that paper for an example with MNIST digits. This is just one of a series of intriguing papers on matchings that came out of Tony Jebara's lab, there's a nice overview on his page that ties them together.