Thursday, September 08, 2011

notMNIST dataset

I've taken some publicly available fonts and extracted glyphs from them to make a dataset similar to MNIST. There are 10 classes, with letters A-J taken from different fonts. Here are some examples of letter "A" Judging by the examples, one would expect this to be a harder task than MNIST. This seems to be the case -- logistic regression on top of stacked auto-encoder with fine-tuning gets about 89% accuracy whereas same approach gives got 98% on MNIST. Dataset consists of small hand-cleaned part, about 19k instances, and large uncleaned dataset, 500k instances. Two parts have approximately 0.5% and 6.5% label error rate. I got this by looking through glyphs and counting how often my guess of the letter didn't match it's unicode value in the font file. Matlab version of the dataset (.mat file) can be accessed as follows:
for i=1:5
Zipped version is just a set of png images grouped by class. You can turn zipped version of dataset into Matlab version as follows
tar -xzf notMNIST_large.tar.gz
python notMNIST_large notMNIST_large.mat
Approaching 0.5% error rate on notMNIST_small would be very impressive. If you run your algorithm on this dataset, please let me know your results.


Anonymous said...

What does your baseline get on the negated version of the dataset? In other words, make the "ink" pixels have intensity 1 and the non-ink pixels have intensity zero. I would be curious to know if your baseline does better on one version or the other.

Yaroslav Bulatov said...

I don't expect it to make any difference -- pixel level features are learned by stacked autoencoder, and there's nothing biasing to learner to prefer 0's or 1's to start with

Anonymous said...

It makes a difference on MNIST, which is why I asked.

Anonymous said...

WE can use this data even if we do research on this?for instance if we obtain relatively good results
or propose something novel are we allowed to publish anything on it?

Yaroslav Bulatov said...

ML_random_guy -- that depends on whether your country has laws against publishing

Anonymous said...

Hello, it's nice to have such a new challenging dataset. Do you recommend a specific evaluation protocol (number of training/test images) ? Otherwise people will work on different subsets and results will not be directly comparable.

Yaroslav Bulatov said...

Train on the whole "dirty" dataset, evaluate on the whole "clean" dataset.

This is a better indicator of real-life performance of a system than traditional 60/30 split because there is often a ton of low-quality ground truth and small amount of high quality ground truth. For this task, I can get millions, possibly billions of distinct digital glyph images with 5-10% labels wrong, but I'm stuck with small amount of near perfectly labeled glyphs

Anonymous said...

Thanks for the protocol info. Would it be possible to get a tar archive with PNG images of the small dataset like the huge one ? I'm not using Matlab.

Yaroslav Bulatov said...

Oops, small tar should've been in the directory to start with, fixed

osdf said...

I used this dataset to test some of my code and got about 3.8% error rate. Are there more results known for this dataset? A few lines of text are here.

Yaroslav Bulatov said...

Hey, that's pretty impressive! This is the highest accuracy I know. I'm working on a larger dataset to release publicly, but slowed down by some legal clearance hurdles

Yaroslav Bulatov said...

How do you do finetuning? Hinton's contrastive wake-sleep?

goodfellow.ian said...

What is unicode370k.tar.gz?

Yaroslav Bulatov said...

It's a bunch of characters taken from the tail end of unicode values

Nicholas Leonard said...

How did you split your dataset into train,valid,test to get 89%?

Nicholas Leonard said...

Hi, myself and Zhen Zhou from the LISA lab at Université de Montréal trained a couple of 4 layer MLPs with 1024-300-50 hidden neurons respectively. We divided the noisy set into 5/6 train 1/6 valid and kept the clean set for testing. We 97.1% accuracy on the test set at 412 epoch with early stopping, linear decay of the learning rate, a hard constraint on the norm of the weights and tanh activation units. We get approximately 93 on valid and 98 on train. The train set is easy to overfit (you can get 100% accuracy on train if you continue training). One could probably do better if they pursue hyper-optimization further. We used Torch 7.

Georg Friedrich said...

I got with a simple neural network (784,1024,10), whereas the activation functions where RELU and then just a normal softmax. Without activation decay, pre stop, dropout & co and 3001 iterations and a batch size of 128, I got 89.3% accuracy on the test set.

Step: 3000
Minibatch accuracy: 86.7%
Validation accuracy: 82.6%

Finish (after Step 3001):
Test accuracy: 89.3%

Pavlos Mitsoulis - Ntompos said...

Minibatch loss at step 3000: 55.872269
Minibatch accuracy: 79.7%
Validation accuracy: 84.4%
Test accuracy: 90.6%

With a neural network with a single hidden layer (1024 nodes), Relu and l2 regularization.

Phạm T. Lâm said...

Minibatch loss at step 10000: 123.963661
Minibatch accuracy: 45.3%
Validation accuracy: 85.3%
Test accuracy: 91.4%

With a dropout and relu and l2 regularizer, single hidden layer 1024 node.

Alec Karfonta said...

Yaroslav Bulatov,

Thank you for the fun and challenging dataset.

How were the names of the files chosen?

I'm working on renaming each one to the phash value of the image. It looks like the names might already be the result of a hash.

Alec Karfonta said...
This comment has been removed by the author.
Alec Karfonta said...
This comment has been removed by the author.
Alec Karfonta said...

Check out the Udacity course in deep learning, made by Google. They use this dataset extensively and show some really powerful techniques. The goal of the last assignment was to experiment with this techniques to find the best accuracy using a regular multi-layer perceptron. I have a pretty beefy machine: 6600K OC, 2x GTX 970 OC, 16gb DDR4, Samsung 950 Pro; so I set up a decent sized network and let it train for a while.

My best network gets:

Test accuracy: 97.4%
Validation accuracy: 91.9%
Minibatch accuracy: 97.9%

First I applied a Phash to every image and removed any with direct collisions. Then I split the large folder into ~320k training and ~80k validation. I used ~17k in the small folder for testing. Trained on mini-batches using SGD on the cross-entropy, dropout between each layer and an exponentially decaying learning rate. The network has three hidden layers with RELU units, plus a standard softmax output layer.

Here are the parameters:
Mini-batch size: 1024
Hidden layer 1 size: 4096
Hidden layer 2 size: 2048
Hidden layer 3 size: 1024
Initial learning rate: 0.1
Dropout probability: 0.5

I ran this for 150k iterations, took an hour and half using one GPU. Learning pretty much stopped at 60k, but the model never began to overfit. I believe that is because the dataset is so large and the dropout. Even at the end of all that training with a good size network the mini-batch accuracy still did not reach 100% so learning could continue, albeit slowly.

The next assignment is to use a convolutional network, which looks promising. I'll try to post those results too.

Anonymous said...

Could you make your code available? Or at least say which parameters you have use to the exponentially decaying learning rate? Did you use l2 regularization (if yes, with which regularization factor?) I tried to use the same network as you did and it simply doesn't converge.

Gabriel said...

Test accuracy: 98.09%

With a CNN layout as follows:

3 x convolutional (3x3)
max pooling (2x2)
dropout (0.25)

3 x convolutional (3x3)
max pooling (2x2)
dropout (0.25)

dense (4*N)
dropout (0.5)
dense (2*N)
dropout (0.5)
dense (N)
dropout (0.5)
softmax (10)

N is the number of pixel in the images. All layers use relu activation. I also used some zero padding before each convolutional layer. The network was trained with Adadelta. It took ~45 iterations with an early stopping at patience 10. As a final step I ran SGD with the same early stopping and decaying learning rate starting at 0.1. It ran about 15 iterations. Evaluating the network on the training set, the accuracy was 99.07% and 94.25% on the validation set.

杨健程 said...
This comment has been removed by the author.
杨健程 said...

Minibatch loss at step 4999: 0.901939
Minibatch accuracy: 75.0%
Validation accuracy: 87.3%
Test accuracy: 93.3% @step=4999
Model saved in file: save/myconvnet_5000

I used a architecture similar to LeNet, and it seems to be better as step get larger.