Thursday, September 08, 2011

notMNIST dataset

I've taken some publicly available fonts and extracted glyphs from them to make a dataset similar to MNIST. There are 10 classes, with letters A-J taken from different fonts. Here are some examples of letter "A" Judging by the examples, one would expect this to be a harder task than MNIST. This seems to be the case -- logistic regression on top of stacked auto-encoder with fine-tuning gets about 89% accuracy whereas same approach gives got 98% on MNIST. Dataset consists of small hand-cleaned part, about 19k instances, and large uncleaned dataset, 500k instances. Two parts have approximately 0.5% and 6.5% label error rate. I got this by looking through glyphs and counting how often my guess of the letter didn't match it's unicode value in the font file. Matlab version of the dataset (.mat file) can be accessed as follows:
load('notMNIST_small.mat')
for i=1:5
    figure('Name',num2str(labels(i))),imshow(images(:,:,i)/255)
end
Zipped version is just a set of png images grouped by class. You can turn zipped version of dataset into Matlab version as follows
tar -xzf notMNIST_large.tar.gz
python matlab_convert.py notMNIST_large notMNIST_large.mat
Approaching 0.5% error rate on notMNIST_small would be very impressive. If you run your algorithm on this dataset, please let me know your results.

529 comments:

«Oldest   ‹Older   601 – 529 of 529
«Oldest ‹Older   601 – 529 of 529   Newer› Newest»