Machine Learning, etc: New Robust OCR dataset

Wednesday, August 05, 2009

New Robust OCR dataset

I've collected this dataset for a project that involves automatically reading bibs in pictures of marathons and other races. This dataset is larger than robust-reading dataset of ICDAR 2003 competition with about 20k digits and more uniform because it's digits-only. I believe it is more challenging than the MNIST digit recognition dataset.

I'm now making it publicly available in hopes of stimulating progress on the task of robust OCR. Use it freely, with only requirement that if you are able to exceed 80% accuracy, you have to let me know ;)

The dataset file contains raw data (images), as well as Weka-format ARFF file for simple set of features.

For completeness I include matlab script used to for initial pre-processing and feature extraction, Python script to convert space-separated output into ARFF format. Check "readme.txt" for more details.

Dataset

20 comments:

Anonymous9:25 AM
hi!

is that correct numbers?

bib_digits_dataset/src$ wc -l digits.ssv
19804 digits.ssv

$ head -1 digits.ssv | wc -w
201

please define the size of training and test data as well as how to split the data exactly.

furthermore, do you mean "1 - error rate" by accuracy? so, the trace of the confusion matrix, isn't it?

thanks!
p.
ReplyDelete
Replies
Yaroslav Bulatov10:03 AM
There's a total of 19804 instances. Each instance has 200 attributes and 1 class label in the baseline feature extraction scheme included, so numbers are correct.

When I ran my experiments I used first 15000 instances for training and the rest for testing, getting 3830 correct (79.8% accuracy). The algorithm was Weka's SMO with Gaussian kernel with gradient, pixel level and Zernicke moment features. By accuracy I mean the percentage of instances that are classified correctly.

At the moment I'm uploading updated tar.gz with labels.txt and labels.test.txt facilitating this split. SSV and ARFF files are left unupdated, so it's up to user to regenerate them, or split them into 15000/3804 manually
ReplyDelete
Replies
Anonymous10:16 AM
thank you for the quick response!

so basically it would suffice to do a

head -n 15000 digits.ssv > digits.train.ssv

tail -n 3830 digits.ssv > digits.test.ssv

did I get you right?

p.
ReplyDelete
Replies
Yaroslav Bulatov10:29 AM
almost, except use tail -4804
ReplyDelete
Replies
Anonymous5:41 AM
yes, yor're completely right, my fault :)
ReplyDelete
Replies
Pete Skomoroch1:05 PM
Thanks for pulling this together, great stuff. What license are you releasing this dataset under? Could you release it under the Creative Commons Attribution or the GNU Free Documentation License?
ReplyDelete
Replies
Yaroslav Bulatov5:37 PM
Peter -- thanks! I haven't thought about licensing issues, but a quick google search shows that "public domain" is pretty permissive, so for the lawyers out there -- this dataset is part of the public domain
ReplyDelete
Replies
Miran3:09 PM
You get 85% accuracy by using a slightly modified gradient feature extraction algorithm. Gradient vectors are decomposed into a sum of components in two closest target directions. Sum of gradient strengths for 8 directions in 5x5 zones are used as features. I have matlab code if you are interested.
ReplyDelete
Replies
Yaroslav Bulatov3:13 PM
Sure, that'd be pretty interesting to see....yaroslavvb@gmail.com

What classifier did you use to get 85% accuracy?
ReplyDelete
Replies
Miran5:13 AM
Support vector machines, radial basis function kernel, I used libsvm library. I'll mail you the code now.
ReplyDelete
Replies
Anonymous9:52 PM
I am also interested in Multi class SVM with RBF kernal for OCR.
Can I get the code.
ReplyDelete
Replies
Innovative mind7:45 PM
Hi,

Iam planning to do final project(machine learning) in our college on OCR.Can you please give me the code or can you guide me how to do it.
ReplyDelete
Replies
Anonymous6:52 PM
what are the 200 features? how were they generated? Also is there a way to identify the arff line with the image? thanks
ReplyDelete
Replies
Unknown1:05 PM
can i get the code please
ReplyDelete
Replies
ben4:02 AM
This curiosity you describe is fantastic
Hookers London
ReplyDelete
Replies
draj10:55 PM
thanks for sharing this blog,try this blog too...
Seo Internship in Bangalore
Smo Internship in Bangalore
Digital Marketing Internship Program in Bangalore
ReplyDelete
Replies
john11:13 PM
Great Article
IEEE final year projects on machine learning

JavaScript Training in Chennai

Final Year Project Centers in Chennai

JavaScript Training in Chennai

ReplyDelete
Replies
NewMovie12:13 AM
thank for sharing

ebet
gclub casino
หวยรัฐบาล
หวยลาว
ReplyDelete
Replies
Geek Info1:31 AM
LIC JEEVAN

BATHROOM NEAR ME

LAPTOP INSURANCE

OTHER ONLINE FREE

VOTER ID

VOTER CARD AADHAR CARD

DUPLICATE VOTER ID

SBI BALANCE ENQUIRY
ReplyDelete
Replies
lakshmibhucynix7:48 AM
nice post.
best machine learning course in india
best machine learning course online
ReplyDelete
Replies

Add comment