Thursday, September 08, 2011

notMNIST dataset

I've taken some publicly available fonts and extracted glyphs from them to make a dataset similar to MNIST. There are 10 classes, with letters A-J taken from different fonts. Here are some examples of letter "A" Judging by the examples, one would expect this to be a harder task than MNIST. This seems to be the case -- logistic regression on top of stacked auto-encoder with fine-tuning gets about 89% accuracy whereas same approach gives got 98% on MNIST. Dataset consists of small hand-cleaned part, about 19k instances, and large uncleaned dataset, 500k instances. Two parts have approximately 0.5% and 6.5% label error rate. I got this by looking through glyphs and counting how often my guess of the letter didn't match it's unicode value in the font file. Matlab version of the dataset (.mat file) can be accessed as follows:
load('notMNIST_small.mat')
for i=1:5
    figure('Name',num2str(labels(i))),imshow(images(:,:,i)/255)
end
Zipped version is just a set of png images grouped by class. You can turn zipped version of dataset into Matlab version as follows
tar -xzf notMNIST_large.tar.gz
python matlab_convert.py notMNIST_large notMNIST_large.mat
Approaching 0.5% error rate on notMNIST_small would be very impressive. If you run your algorithm on this dataset, please let me know your results.

16 comments:

Anonymous said...

What does your baseline get on the negated version of the dataset? In other words, make the "ink" pixels have intensity 1 and the non-ink pixels have intensity zero. I would be curious to know if your baseline does better on one version or the other.

Yaroslav Bulatov said...

I don't expect it to make any difference -- pixel level features are learned by stacked autoencoder, and there's nothing biasing to learner to prefer 0's or 1's to start with

Anonymous said...

It makes a difference on MNIST, which is why I asked.

Anonymous said...

WE can use this data even if we do research on this?for instance if we obtain relatively good results
or propose something novel are we allowed to publish anything on it?
regards
ML_random_guy

Yaroslav Bulatov said...

ML_random_guy -- that depends on whether your country has laws against publishing

Anonymous said...

Hello, it's nice to have such a new challenging dataset. Do you recommend a specific evaluation protocol (number of training/test images) ? Otherwise people will work on different subsets and results will not be directly comparable.

Yaroslav Bulatov said...

Train on the whole "dirty" dataset, evaluate on the whole "clean" dataset.

This is a better indicator of real-life performance of a system than traditional 60/30 split because there is often a ton of low-quality ground truth and small amount of high quality ground truth. For this task, I can get millions, possibly billions of distinct digital glyph images with 5-10% labels wrong, but I'm stuck with small amount of near perfectly labeled glyphs

Anonymous said...

Thanks for the protocol info. Would it be possible to get a tar archive with PNG images of the small dataset like the huge one ? I'm not using Matlab.

Yaroslav Bulatov said...

Oops, small tar should've been in the directory to start with, fixed

osdf said...

I used this dataset to test some of my code and got about 3.8% error rate. Are there more results known for this dataset? A few lines of text are here.

Yaroslav Bulatov said...

Hey, that's pretty impressive! This is the highest accuracy I know. I'm working on a larger dataset to release publicly, but slowed down by some legal clearance hurdles

Yaroslav Bulatov said...

How do you do finetuning? Hinton's contrastive wake-sleep?

goodfellow.ian said...

What is unicode370k.tar.gz?

Yaroslav Bulatov said...

It's a bunch of characters taken from the tail end of unicode values

http://yaroslavvb.blogspot.com/2011/11/shapecatcher.html

Nicholas Leonard said...

How did you split your dataset into train,valid,test to get 89%?

Nicholas Leonard said...

Hi, myself and Zhen Zhou from the LISA lab at Université de Montréal trained a couple of 4 layer MLPs with 1024-300-50 hidden neurons respectively. We divided the noisy set into 5/6 train 1/6 valid and kept the clean set for testing. We 97.1% accuracy on the test set at 412 epoch with early stopping, linear decay of the learning rate, a hard constraint on the norm of the weights and tanh activation units. We get approximately 93 on valid and 98 on train. The train set is easy to overfit (you can get 100% accuracy on train if you continue training). One could probably do better if they pursue hyper-optimization further. We used Torch 7.