I've taken some publicly available fonts and extracted glyphs from them to make a

dataset similar to MNIST. There are 10 classes, with letters A-J taken from different fonts.
Here are some examples of letter "A"

Judging by the examples, one would expect this to be a harder task than MNIST. This seems to be the case -- logistic regression on top of stacked auto-encoder with fine-tuning gets about 89% accuracy whereas same approach gives got 98% on MNIST.
Dataset consists of small hand-cleaned part, about 19k instances, and large uncleaned dataset, 500k instances. Two parts have approximately 0.5% and 6.5% label error rate. I got this by looking through glyphs and counting how often my guess of the letter didn't match it's unicode value in the font file.
Matlab version of the dataset (.mat file) can be accessed as follows:

load('notMNIST_small.mat')
for i=1:5
figure('Name',num2str(labels(i))),imshow(images(:,:,i)/255)
end

Zipped version is just a set of png images grouped by class. You can turn zipped version of dataset into Matlab version as follows

tar -xzf notMNIST_large.tar.gz
python matlab_convert.py notMNIST_large notMNIST_large.mat

Approaching 0.5% error rate on notMNIST_small would be very impressive. If you run your algorithm on this dataset, please let me know your results.

## 63 comments:

What does your baseline get on the negated version of the dataset? In other words, make the "ink" pixels have intensity 1 and the non-ink pixels have intensity zero. I would be curious to know if your baseline does better on one version or the other.

I don't expect it to make any difference -- pixel level features are learned by stacked autoencoder, and there's nothing biasing to learner to prefer 0's or 1's to start with

It makes a difference on MNIST, which is why I asked.

WE can use this data even if we do research on this?for instance if we obtain relatively good results

or propose something novel are we allowed to publish anything on it?

regards

ML_random_guy

ML_random_guy -- that depends on whether your country has laws against publishing

Hello, it's nice to have such a new challenging dataset. Do you recommend a specific evaluation protocol (number of training/test images) ? Otherwise people will work on different subsets and results will not be directly comparable.

Train on the whole "dirty" dataset, evaluate on the whole "clean" dataset.

This is a better indicator of real-life performance of a system than traditional 60/30 split because there is often a ton of low-quality ground truth and small amount of high quality ground truth. For this task, I can get millions, possibly billions of distinct digital glyph images with 5-10% labels wrong, but I'm stuck with small amount of near perfectly labeled glyphs

Thanks for the protocol info. Would it be possible to get a tar archive with PNG images of the small dataset like the huge one ? I'm not using Matlab.

Oops, small tar should've been in the directory to start with, fixed

I used this dataset to test some of my code and got about 3.8% error rate. Are there more results known for this dataset? A few lines of text are here.

Hey, that's pretty impressive! This is the highest accuracy I know. I'm working on a larger dataset to release publicly, but slowed down by some legal clearance hurdles

How do you do finetuning? Hinton's contrastive wake-sleep?

What is unicode370k.tar.gz?

It's a bunch of characters taken from the tail end of unicode values

http://yaroslavvb.blogspot.com/2011/11/shapecatcher.html

How did you split your dataset into train,valid,test to get 89%?

Hi, myself and Zhen Zhou from the LISA lab at Université de Montréal trained a couple of 4 layer MLPs with 1024-300-50 hidden neurons respectively. We divided the noisy set into 5/6 train 1/6 valid and kept the clean set for testing. We 97.1% accuracy on the test set at 412 epoch with early stopping, linear decay of the learning rate, a hard constraint on the norm of the weights and tanh activation units. We get approximately 93 on valid and 98 on train. The train set is easy to overfit (you can get 100% accuracy on train if you continue training). One could probably do better if they pursue hyper-optimization further. We used Torch 7.

I got with a simple neural network (784,1024,10), whereas the activation functions where RELU and then just a normal softmax. Without activation decay, pre stop, dropout & co and 3001 iterations and a batch size of 128, I got 89.3% accuracy on the test set.

Step: 3000

Minibatch accuracy: 86.7%

Validation accuracy: 82.6%

Finish (after Step 3001):

Test accuracy: 89.3%

Minibatch loss at step 3000: 55.872269

Minibatch accuracy: 79.7%

Validation accuracy: 84.4%

Test accuracy: 90.6%

With a neural network with a single hidden layer (1024 nodes), Relu and l2 regularization.

Minibatch loss at step 10000: 123.963661

Minibatch accuracy: 45.3%

Validation accuracy: 85.3%

Test accuracy: 91.4%

With a dropout and relu and l2 regularizer, single hidden layer 1024 node.

Yaroslav Bulatov,

Thank you for the fun and challenging dataset.

How were the names of the files chosen?

I'm working on renaming each one to the phash value of the image. It looks like the names might already be the result of a hash.

Check out the Udacity course in deep learning, made by Google. They use this dataset extensively and show some really powerful techniques. The goal of the last assignment was to experiment with this techniques to find the best accuracy using a regular multi-layer perceptron. I have a pretty beefy machine: 6600K OC, 2x GTX 970 OC, 16gb DDR4, Samsung 950 Pro; so I set up a decent sized network and let it train for a while.

My best network gets:

Test accuracy: 97.4%

Validation accuracy: 91.9%

Minibatch accuracy: 97.9%

First I applied a Phash to every image and removed any with direct collisions. Then I split the large folder into ~320k training and ~80k validation. I used ~17k in the small folder for testing. Trained on mini-batches using SGD on the cross-entropy, dropout between each layer and an exponentially decaying learning rate. The network has three hidden layers with RELU units, plus a standard softmax output layer.

Here are the parameters:

Mini-batch size: 1024

Hidden layer 1 size: 4096

Hidden layer 2 size: 2048

Hidden layer 3 size: 1024

Initial learning rate: 0.1

Dropout probability: 0.5

I ran this for 150k iterations, took an hour and half using one GPU. Learning pretty much stopped at 60k, but the model never began to overfit. I believe that is because the dataset is so large and the dropout. Even at the end of all that training with a good size network the mini-batch accuracy still did not reach 100% so learning could continue, albeit slowly.

The next assignment is to use a convolutional network, which looks promising. I'll try to post those results too.

Could you make your code available? Or at least say which parameters you have use to the exponentially decaying learning rate? Did you use l2 regularization (if yes, with which regularization factor?) I tried to use the same network as you did and it simply doesn't converge.

Test accuracy: 98.09%

With a CNN layout as follows:

3 x convolutional (3x3)

max pooling (2x2)

dropout (0.25)

3 x convolutional (3x3)

max pooling (2x2)

dropout (0.25)

dense (4*N)

dropout (0.5)

dense (2*N)

dropout (0.5)

dense (N)

dropout (0.5)

softmax (10)

N is the number of pixel in the images. All layers use relu activation. I also used some zero padding before each convolutional layer. The network was trained with Adadelta. It took ~45 iterations with an early stopping at patience 10. As a final step I ran SGD with the same early stopping and decaying learning rate starting at 0.1. It ran about 15 iterations. Evaluating the network on the training set, the accuracy was 99.07% and 94.25% on the validation set.

Minibatch loss at step 4999: 0.901939

Minibatch accuracy: 75.0%

Validation accuracy: 87.3%

Test accuracy: 93.3% @step=4999

Model saved in file: save/myconvnet_5000

I used a architecture similar to LeNet, and it seems to be better as step get larger.

Where can I download notMNIST? The link above goes to an account that has been suspended.

Not sure if this is the complete dataset, but the Udacity course on Deep Learning using notMNIST provides the following links:

http://commondatastorage.googleapis.com/books1000/notMNIST_large.tar.gz

http://commondatastorage.googleapis.com/books1000/notMNIST_small.tar.gz

Test accuracy: 96.98%

With a CNN layout with following configurations, which is similar to [LeNet5](http://culurciello.github.io/tech/2016/06/04/nets.html)

However there is little difference

convolutional (3x3x8)

max pooling (2x2)

dropout (0.7)

relu

convolutional (3x3x16)

max pooling (2x2)

dropout (0.7)

relu

convolutional (3x3x32)

avg pooling (2x2): according to above article

dropout (0.7)

relu

fully-connected layer (265 features)

relu

dropout (0.7)

fully-connected layer (128 features)

relu

dropout (0.7)

softmax (10)

decaying learning rate starting at 0.1

batch_size: 128

Training accuracy: 93.4%

Validation accuracy: 92.8%

Accuracy: 96.1 without convolution (assignment 3 in TensorFlow course)

Using Xavier initialization significantly boosted my results. Network specifications:

1. Batch size = 2048

2. Hidden units: 4096, 2048, 1024

3. Adam optimizer with 0.0001 learning rate

4. Dropout on each hidden layer

5. Xavier initialization

Hi

I'm trying to use tensorflow to do character recognition. I am able to use your dataset(A-J) and get some data from char74k dataset (from K to Z) to train character data and predict. but the char74k set is a pretty limited set and is not enough to get a good accuracy. Have you posted anything similar for characters from K to Z?

no convolution, 1 hidden layer 94.4 % with test set

batch size 128

L2 regularization beta 0 (no L2 regularization)

initialize w with deviation 0.03

initialize bias with all 0

Learning rate 0.5 (fix, not decay)

single hidden layer unit # 1024

dropout_keepratio 1 (no dropout)

I'm following udacity tutorial.

It's strange that whenever i put L2 regularization, dropout, Learning rate decay, the test accuracy falls. I can't figure out why.

The test accuracy will fall if you choose a wrong value of regularization parameters. A beta of .005 gives good results.

My accuracy falls slightly after using dropout. Is there a possibility of wrong implementation of tf.nn.dropout() or is it a possible scenario?

Multi Layer Neural Net without convolution - Test Accuracy = 94.4%

Architeture

3 Layer Neural Network(No convolution) = input-784, hidden-526, output=10

L2- Regularization with lambda(regularization parameter) = .001

Number of steps = 3000

Batch size = 500

2 hidden Layers ( Toal 4 layers ) - without convolution - Test Accuracy = 95.8 %

Architecture

3 Layer Neural Network(No convolution) = input-784, hidden1-960, hidden2=650 output=10

L2- Regularization with lambda(regularization parameter) = .0005

Number of steps = 75000

Batch size = 1000

Minibatch accuracy: 93.2%

Validation accuracy: 91.2%

Test accuracy: 96.3%

After 10000 steps.

Architecture:

Two hidden layers:

num_hidden_nodes = 1024

num_hidden_nodes_2 = 100

Both with Relu inputs. Cross entropy + L2 regularization (beta = 1.3e-4).

SGD, batch size 400.

Most importantly, weights were initialized with truncated normal distro. with sigma = 0.01.

Exponential decay starting at 0.5, 0.65 decay_rate every 1000 steps.

Using Keras on an average gaming laptop with moderate GPU, training took less than 2' on the full (udacity) training set of 200.000 samples, using 10.000 validation samples and measuring accuracy on separate test set of 10.000 samples.

With a simple multilayer network, I reached 96.66%

With KERAS, the code for the network itself is really simple:

batch_size = 128

nb_classes = 10

nb_epoch = 20

model = Sequential()

model.add(Dense(1024, input_shape=(784,)))

model.add(Activation('relu'))

model.add(Dropout(0.2))

model.add(Dense(512))

model.add(Activation('relu'))

model.add(Dropout(0.2))

model.add(Dense(256))

model.add(Activation('relu'))

model.add(Dropout(0.2))

model.add(Dense(10))

model.add(Activation('softmax'))

model.summary()

model.compile(loss='categorical_crossentropy',

#optimizer=RMSprop(),

optimizer='adagrad',

#optimizer='adadelta',

metrics=['accuracy'])

history = model.fit(train_dataset, train_labels,

batch_size=batch_size, nb_epoch=nb_epoch,

verbose=1, validation_data=(valid_dataset, valid_labels))

score = model.evaluate(test_dataset, test_labels, verbose=0)

If you want to generate your own dataset like notMNIST, you should try

not_notMNISTMy final result is 96.23% accuracy. Network architecture (built with Keras):

conv(3x3x32)

maxp(2x2)

dropout(0.05)

conv(3x3x16)

maxp(2x2)

dropout(0.05)

dense(128, relu)

dense(64, relu)

dense(10, softmax)

I used SGD with default params. Also got 92.03% on valid dataset, 92.24% on train dataset. Seems that it is global tendency that test score is higher,

97.2% on a fully connected net.

At last iteration, 100k:

Minibatch accuracy: 99.0%

Validation accuracy: 92.2%

Test accuracy: 97.2%

Architecture:

3 hidden layers, 4096 - 3072 - 1024, with relu and 0.5 dropout

Xavier weight init

Batch size 200

Data sets original (200k train, 10k valid, 10k test), no further preprocessing

Loss: softmax_cross_entropy_with_logits + L2 regularization on weights with weight of 1e-4

Learning rate 0.3 with decay of 0.96 every 1000 iterations

Total 100k iterations

[edit - I forgot the dropout on first post]

Test accuracy: 96.12% with only 5000 iterations on a convolutional network with two conv layers and a final fully connected layer.

Minibatch of 50 images was used.

On a very simple 1 hidden layer network without regularization I also get:

Minibatch accuracy: 89.8%

Validation accuracy: 82.9%

Test accuracy: 89.8%

I've seen many other users reporting Test accuracy which is significantly higher than validation accuracy.

Validation and Test are the same size in my case. Is a higher test score reasonable or is it just chance? Should I consider the worst between test and validation as the expected performance of my network?

Gianni

I use 2 hidden layers and GradientDescentOptimizer, but the loss is nan. Why?

Try to reduce your learning rate

96.2% on a fully connected net.

setps, 200000:

batch 200 accuracy: 94.0%

test accuracy: 96.2%

https://github.com/ms03001620/NotMnist

test acc 98.3%

mini-batch train acc 95.7%

val acc 94.2%

Techniques: "shallow" resnet (used val set to select arch), dropout, horizontal + vertical shift data augmentation, reduce lr on plateaus.

implementation

Test Accuracy 95.5%

with batch size = 128, number of iterations = 10k

Three 5x5 convoution layers of depth 16, 32, 64 respectively

Three hidden layers with number of hidden nodes 256, 128 and 64 respectively

Dropout 0.7

Learning decay starting with 0.2 learning rate

https://sandipanweb.wordpress.com/2017/08/03/deep-learning-with-tensorflow-in-python-2/

Test accuracy: 97.2%

Implementation:

2 CNNs with max pooling followed by a 1 layer fully-connected NN:

Patch size = 7x7

Stride for CNN = 1

Size of pooling size = 2x2

Stride for Pooling = 2

Depth = 50, 100

Final layer nodes = 512

Dropout_keep_probe = 0.7

Managed to achieve 97.3% test score!

Used 5 hidden layers, batch_size = 256, adam optimisation with initial learning rate of 1e-4, 200,000 steps. Would be happy to share details, code etc. if anybody is interested.

Test accuracy: 98.0%

Implementation:

2 CNNs with max pooling followed by a 1 layer fully-connected NN:

Patch size = 5x5

Stride for CNN = 1

Size of pooling size = 2x2

Stride for Pooling = 2

Depth = 50, 100

Hidden layer Nodes in FCNN = 512

Dropout_keep_probe = init: 95%, decaying to 70%

Udacity Deep Learning course challenged me to get as high accuracy as I can using only dense layers, without any convolutions.

09-16 04:03:04.724 assignment_03_regularization.py:470 INFO Train loss: 0.167651, train accuracy: 97.99%

09-16 04:03:04.725 assignment_03_regularization.py:473 INFO Test loss: 0.256166, TEST ACCURACY: 96.51% BEST ACCURACY 96.64% <<<<<<<

Managed to achieve only as high as 96.6% with the following model:

- 5 fully connected layers 2048-1024-1024-1024-512

- 0.5 dropout

- batch normalization

- weight decay with 0.00001 scale

- batch 128 images

- Adam optimizer with starting LR=1e-4

- Xavier weight initialization (this is critical!)

This is on "sanitized" test dataset, where I removed all images that were identical or close to some images in training data. Without this sanitizing, it would've probably been a bit better.

Trained that for 2 hours on GTX1060, it continued to climb higher, but slowly.

Code is here: https://github.com/alex-petrenko/udacity-deep-learning/blob/master/assignment_03_regularization.py

(function is called train_deeper_better())

Simple convnet achieved 98.19% on test, code here: https://github.com/alex-petrenko/udacity-deep-learning/blob/14714ee4151b798cde0a31a94ac65e08b87d0f65/assignment_04_convolutions.py#L39

(5,5)->(5,5)->pool->(3,3)->(3,3)->pool->fc1024->fc1024->logits

INFO Starting new epoch #121!

INFO Minibatch loss: 0.150696, reg loss: 0.041653, accuracy: 96.88%

INFO Train loss: 0.068505, train accuracy: 99.28%

INFO Test loss: 0.118685, TEST ACCURACY: 98.05% BEST ACCURACY 98.19% <<<<<<<

Test accuracy: 96.7%

Implementation:

* 2 hidden layers of 1024 & 256 nodes

* weight initialization using gaussian random distribution with stddev = 2/sqrt(size of layer input)

* minibatch size of 128

* 30 epochs -each epoch full pass over train data set, randomly shuffled by minibatches (200000 / 128 = 1563 steps per epoch)

* learning rate 0.1

* dropout with keep_prob = 0.9

* no L2 regularization

After epoch 30:

train accuracy = 98.0%

dev accuracy = 92.0%

test accuracy = 96.7%

My best straight-forward CNN:

C5x4-C19x8-C5x16-P2-C7x64-P2-C3x256-P3S-C1x1024-C1x512-F2048-F64-F10

where

C5x4 = convolution with 5x5 kernel and 4 maps output

P3S = pooling with 3x3 size of type SAME

ReLU

initial weight SD: 0.05 for conv layers, Xavier for full layers

max pooling

full layer dropout 0,6

conv layer dropout 0,1

conv layer dropout before pooling

shuffle train dataset after each epoch

momentum optimizer with learning rate 0.05

batch size 2048

after 470 epochs (early stop):

train accuracy 99.6%

validation accuracy 94.6%

test accuracy 98.2%

This configuration was evolved looking for hyperparameter optimization, and came up at the 103rd configuration try. Further runs of another 34 configurations did not improve it.

During training, I have seen as much as 98.4% on test set, but corresponding to lower validation accuracy. In general, several runs of the same configuration could end up with 0.3% difference in validation accuracy. So ultimately one could run several times the winner configuration until the lucky initial weights combination is reached.

Post a Comment