Wednesday, August 05, 2009

New Robust OCR dataset




I've collected this dataset for a project that involves automatically reading bibs in pictures of marathons and other races. This dataset is larger than robust-reading dataset of ICDAR 2003 competition with about 20k digits and more uniform because it's digits-only. I believe it is more challenging than the MNIST digit recognition dataset.

I'm now making it publicly available in hopes of stimulating progress on the task of robust OCR. Use it freely, with only requirement that if you are able to exceed 80% accuracy, you have to let me know ;)

The dataset file contains raw data (images), as well as Weka-format ARFF file for simple set of features.

For completeness I include matlab script used to for initial pre-processing and feature extraction, Python script to convert space-separated output into ARFF format. Check "readme.txt" for more details.

Dataset

20 comments:

  1. Anonymous9:25 AM

    hi!

    is that correct numbers?

    bib_digits_dataset/src$ wc -l digits.ssv
    19804 digits.ssv

    $ head -1 digits.ssv | wc -w
    201

    please define the size of training and test data as well as how to split the data exactly.

    furthermore, do you mean "1 - error rate" by accuracy? so, the trace of the confusion matrix, isn't it?

    thanks!
    p.

    ReplyDelete
  2. There's a total of 19804 instances. Each instance has 200 attributes and 1 class label in the baseline feature extraction scheme included, so numbers are correct.

    When I ran my experiments I used first 15000 instances for training and the rest for testing, getting 3830 correct (79.8% accuracy). The algorithm was Weka's SMO with Gaussian kernel with gradient, pixel level and Zernicke moment features. By accuracy I mean the percentage of instances that are classified correctly.

    At the moment I'm uploading updated tar.gz with labels.txt and labels.test.txt facilitating this split. SSV and ARFF files are left unupdated, so it's up to user to regenerate them, or split them into 15000/3804 manually

    ReplyDelete
  3. Anonymous10:16 AM

    thank you for the quick response!

    so basically it would suffice to do a

    head -n 15000 digits.ssv > digits.train.ssv

    tail -n 3830 digits.ssv > digits.test.ssv

    did I get you right?

    p.

    ReplyDelete
  4. almost, except use tail -4804

    ReplyDelete
  5. Anonymous5:41 AM

    yes, yor're completely right, my fault :)

    ReplyDelete
  6. Thanks for pulling this together, great stuff. What license are you releasing this dataset under? Could you release it under the Creative Commons Attribution or the GNU Free Documentation License?

    ReplyDelete
  7. Peter -- thanks! I haven't thought about licensing issues, but a quick google search shows that "public domain" is pretty permissive, so for the lawyers out there -- this dataset is part of the public domain

    ReplyDelete
  8. You get 85% accuracy by using a slightly modified gradient feature extraction algorithm. Gradient vectors are decomposed into a sum of components in two closest target directions. Sum of gradient strengths for 8 directions in 5x5 zones are used as features. I have matlab code if you are interested.

    ReplyDelete
  9. Sure, that'd be pretty interesting to see....yaroslavvb@gmail.com

    What classifier did you use to get 85% accuracy?

    ReplyDelete
  10. Support vector machines, radial basis function kernel, I used libsvm library. I'll mail you the code now.

    ReplyDelete
  11. Anonymous9:52 PM

    I am also interested in Multi class SVM with RBF kernal for OCR.
    Can I get the code.

    ReplyDelete
  12. Hi,

    Iam planning to do final project(machine learning) in our college on OCR.Can you please give me the code or can you guide me how to do it.

    ReplyDelete
  13. Anonymous6:52 PM

    what are the 200 features? how were they generated? Also is there a way to identify the arff line with the image? thanks

    ReplyDelete
  14. can i get the code please

    ReplyDelete
  15. This curiosity you describe is fantastic
    Hookers London

    ReplyDelete