Wednesday, August 05, 2009

New Robust OCR dataset

I've collected this dataset for a project that involves automatically reading bibs in pictures of marathons and other races. This dataset is larger than robust-reading dataset of ICDAR 2003 competition with about 20k digits and more uniform because it's digits-only. I believe it is more challenging than the MNIST digit recognition dataset.

I'm now making it publicly available in hopes of stimulating progress on the task of robust OCR. Use it freely, with only requirement that if you are able to exceed 80% accuracy, you have to let me know ;)

The dataset file contains raw data (images), as well as Weka-format ARFF file for simple set of features.

For completeness I include matlab script used to for initial pre-processing and feature extraction, Python script to convert space-separated output into ARFF format. Check "readme.txt" for more details.



Anonymous said...


is that correct numbers?

bib_digits_dataset/src$ wc -l digits.ssv
19804 digits.ssv

$ head -1 digits.ssv | wc -w

please define the size of training and test data as well as how to split the data exactly.

furthermore, do you mean "1 - error rate" by accuracy? so, the trace of the confusion matrix, isn't it?


Yaroslav said...

There's a total of 19804 instances. Each instance has 200 attributes and 1 class label in the baseline feature extraction scheme included, so numbers are correct.

When I ran my experiments I used first 15000 instances for training and the rest for testing, getting 3830 correct (79.8% accuracy). The algorithm was Weka's SMO with Gaussian kernel with gradient, pixel level and Zernicke moment features. By accuracy I mean the percentage of instances that are classified correctly.

At the moment I'm uploading updated tar.gz with labels.txt and labels.test.txt facilitating this split. SSV and ARFF files are left unupdated, so it's up to user to regenerate them, or split them into 15000/3804 manually

Anonymous said...

thank you for the quick response!

so basically it would suffice to do a

head -n 15000 digits.ssv > digits.train.ssv

tail -n 3830 digits.ssv > digits.test.ssv

did I get you right?


Yaroslav said...

almost, except use tail -4804

Anonymous said...

yes, yor're completely right, my fault :)

Peter said...

Thanks for pulling this together, great stuff. What license are you releasing this dataset under? Could you release it under the Creative Commons Attribution or the GNU Free Documentation License?

Yaroslav said...

Peter -- thanks! I haven't thought about licensing issues, but a quick google search shows that "public domain" is pretty permissive, so for the lawyers out there -- this dataset is part of the public domain

Miran said...

You get 85% accuracy by using a slightly modified gradient feature extraction algorithm. Gradient vectors are decomposed into a sum of components in two closest target directions. Sum of gradient strengths for 8 directions in 5x5 zones are used as features. I have matlab code if you are interested.

Yaroslav said...

Sure, that'd be pretty interesting to

What classifier did you use to get 85% accuracy?

Miran said...

Support vector machines, radial basis function kernel, I used libsvm library. I'll mail you the code now.

Anonymous said...

I am also interested in Multi class SVM with RBF kernal for OCR.
Can I get the code.

Innovative mind said...


Iam planning to do final project(machine learning) in our college on OCR.Can you please give me the code or can you guide me how to do it.

Anonymous said...

what are the 200 features? how were they generated? Also is there a way to identify the arff line with the image? thanks

Unknown said...

can i get the code please