Machine Learning, etc: May 2015

Some ICLR posters that caught my eye:

[larger image]
Very simple to implement idea that gives impressive results. They force two groups of units to be uncorrelated by penalizing their cross covariance. When the first group is also forced to model classes, the second group automatically models the "style". The problem if separating out "style" has been studied for a while, see Tenenbaum's "Separating style and content with bilinear models" and reverse citations, but this implementation is the simplest I've seen and could be implemented as a simple modification of batch-SGD.

[larger image]

Similar to Yann LeCun's "Siamese Network" but potentially makes learning problem easier by looking at 2 pairs of examples at each step -- two examples of same class + two examples of different class. The loss models these pair distances -- first distance must be smaller.

[larger image]

Apparently Adobe has collected photographs of text manually tagged with font identity, which authors promise to release at http://www.atlaswang.com/deepfont.html . Biased here because I've looked at fonts before (ie, see notMNIST dataset).

[larger image]

Similar to James Martens Optimizing Neural Networks with Kronecker-factored Approximate Curvature. Both papers make use of that fact that if you want to white gradient in some layer, it's equivalent (but cheaper) to whiten activations and backprop values instead. Martens inverts activation covariance matrix directly, Povey has an iterative scheme to approximate it with something easy to invert.

[larger image]

With Async SGD workers end up having stale values that deviate from what the rest of the workers look at. Here they derive some guidance on how far each worker is allowed to deviate.

[larger image]

They get better results by doing SGD where examples are sampled using non-uniform strategy, such as sampling in proportion to gradient magnitude. This is empirical confirmation of results from Stochastic Gradient Methods workshop last year, with people showing optimal SGD sampling strategies related to "take high gradient magnitude steps first"

[larger image]

These are essentially separable ConvNets, where instead of learning regular filters and then trying to approximate them with low rank expansion, they learn rank-1 filters right away. Such decrease in parameters while keeping same performance and architecture is dramatic and surprising.

[larger image]

The most intriguing paper of the conference IMHO. It's an alternative to backprop for training networks. The basic idea is this -- your neural network is a function

$$f_\text{dataset}(\text{parameters})=\text{labels}$$

In ideal world, you could learn your labels by inverting f, then your estimate is
$$\text{new_params}=f^{-1}(\text{labels})$$

That's too hard, so you approximate $f$ with a quadratic function with unit Hessian $g$ and invert that and let

$$\text{newparams}=g^{-1}(\text{labels})$$

Since original approximation was bad, new estimate will be off, so you repeat the step above. This gives you standard gradient descent.

An alternative approach is to try to learn $g^{-1}$ from data directly. In particular, notice that in a standard autoencoder trained to do $f(g(x))=x$, $g$ an be viewed as an inverse of $f$. So in this paper, they treat neural network as a composition of functions corresponding to layers and try to invert each one using approach above.

[larger image]

Biased because I've played with Theano, but apparently someone got Theano running well enough to train ImageNet in similar speed to Caffe (30% slower on 1 GPU from some beer conversations)

[larger image]

If you have 10k classes, you'll need to take dot product of your last layer activations with 10k rows of the last parameter matrix. Instead, you could use WTA hashing to find a few rows likely to produce dot product with your activation vector, and only look at those.

Nvidia demoing their 4-GPU rig.

[larger image]

Use some off-the-shelf software to decompose 4d convolutional tensor (width-height-input depth-output depth) into faster to compute Tensor product. However, if Flattened convolutional networks work for other domains, this seems like an overkill.

[larger image]

They took network trained on ImageNet and sent regions of the image that activated by particular filters to labelers, to try to figure out what various neurons respond to. A surprisingly high number of neurons learned semantically meaningful concepts.

This was further reinforced by demo by Jason Yosinski showing visualization of various filters in ImageNet network. You could see emergence of "Face" and "Wrinkle" detectors, even though no such class is in ImageNet

Machine Learning, etc

Tuesday, May 12, 2015

ICLR 2015