Wednesday, August 05, 2009
New Robust OCR dataset
I've collected this dataset for a project that involves automatically reading bibs in pictures of marathons and other races. This dataset is larger than robust-reading dataset of ICDAR 2003 competition with about 20k digits and more uniform because it's digits-only. I believe it is more challenging than the MNIST digit recognition dataset.
I'm now making it publicly available in hopes of stimulating progress on the task of robust OCR. Use it freely, with only requirement that if you are able to exceed 80% accuracy, you have to let me know ;)
The dataset file contains raw data (images), as well as Weka-format ARFF file for simple set of features.
For completeness I include matlab script used to for initial pre-processing and feature extraction, Python script to convert space-separated output into ARFF format. Check "readme.txt" for more details.
Dataset
hi!
ReplyDeleteis that correct numbers?
bib_digits_dataset/src$ wc -l digits.ssv
19804 digits.ssv
$ head -1 digits.ssv | wc -w
201
please define the size of training and test data as well as how to split the data exactly.
furthermore, do you mean "1 - error rate" by accuracy? so, the trace of the confusion matrix, isn't it?
thanks!
p.
There's a total of 19804 instances. Each instance has 200 attributes and 1 class label in the baseline feature extraction scheme included, so numbers are correct.
ReplyDeleteWhen I ran my experiments I used first 15000 instances for training and the rest for testing, getting 3830 correct (79.8% accuracy). The algorithm was Weka's SMO with Gaussian kernel with gradient, pixel level and Zernicke moment features. By accuracy I mean the percentage of instances that are classified correctly.
At the moment I'm uploading updated tar.gz with labels.txt and labels.test.txt facilitating this split. SSV and ARFF files are left unupdated, so it's up to user to regenerate them, or split them into 15000/3804 manually
thank you for the quick response!
ReplyDeleteso basically it would suffice to do a
head -n 15000 digits.ssv > digits.train.ssv
tail -n 3830 digits.ssv > digits.test.ssv
did I get you right?
p.
almost, except use tail -4804
ReplyDeleteyes, yor're completely right, my fault :)
ReplyDeleteThanks for pulling this together, great stuff. What license are you releasing this dataset under? Could you release it under the Creative Commons Attribution or the GNU Free Documentation License?
ReplyDeletePeter -- thanks! I haven't thought about licensing issues, but a quick google search shows that "public domain" is pretty permissive, so for the lawyers out there -- this dataset is part of the public domain
ReplyDeleteYou get 85% accuracy by using a slightly modified gradient feature extraction algorithm. Gradient vectors are decomposed into a sum of components in two closest target directions. Sum of gradient strengths for 8 directions in 5x5 zones are used as features. I have matlab code if you are interested.
ReplyDeleteSure, that'd be pretty interesting to see....yaroslavvb@gmail.com
ReplyDeleteWhat classifier did you use to get 85% accuracy?
Support vector machines, radial basis function kernel, I used libsvm library. I'll mail you the code now.
ReplyDeleteI am also interested in Multi class SVM with RBF kernal for OCR.
ReplyDeleteCan I get the code.
Hi,
ReplyDeleteIam planning to do final project(machine learning) in our college on OCR.Can you please give me the code or can you guide me how to do it.
what are the 200 features? how were they generated? Also is there a way to identify the arff line with the image? thanks
ReplyDeletecan i get the code please
ReplyDeleteThis curiosity you describe is fantastic
ReplyDeleteHookers London
thanks for sharing this blog,try this blog too...
ReplyDeleteSeo Internship in Bangalore
Smo Internship in Bangalore
Digital Marketing Internship Program in Bangalore
Great Article
ReplyDeleteIEEE final year projects on machine learning
JavaScript Training in Chennai
Final Year Project Centers in Chennai
JavaScript Training in Chennai
thank for sharing
ReplyDeleteebet
gclub casino
หวยรัฐบาล
หวยลาว
LIC JEEVAN
ReplyDeleteBATHROOM NEAR ME
LAPTOP INSURANCE
OTHER ONLINE FREE
VOTER ID
VOTER CARD AADHAR CARD
DUPLICATE VOTER ID
SBI BALANCE ENQUIRY
nice post.
ReplyDeletebest machine learning course in india
best machine learning course online