Wednesday, November 09, 2011

Google1000 dataset

This is a dataset of scans of 1000 public domain books that was released to the public at ICDAR 2007. At the time there was no public serving infrastructure, so few people actually got the 120GB dataset. It has since been hosted on Google Cloud Storage and made available for public download
  • http://commondatastorage.googleapis.com/books/icdar2007/README.txt
  • http://commondatastorage.googleapis.com/books/icdar2007/[filename]
  • [filename] goes from Volume_0000.zip to Volume_0999.zip

To download it, ie, with Python on Linux box
import os
for i in range(1000):
 if count>1: break
 fname = 'Volume_%04d.zip'%(i)
 f = 'curl -L http://commondatastorage.googleapis.com/books/icdar2007/%s -o %s'%(fname,fname)
 os.system(f)
The goal of this dataset is to facilitate research into image post-processing and accurate OCR for scanned books.

11 comments:

amit gupta said...

In the python script you need to change
if count>1: break
to
if i>1: break

Gokul Ravi said...

nice blog
android training in bangalore
ios training in bangalore
machine learning online training

Gokul Ravi said...

useful blog
python interview questions
cognos interview questions
perl interview questions
vlsi interview questions
web api interview questions
msbi interview questions

Gokul Ravi said...

laravel interview questions
aem interview questions
salesforce interview questions
oops abab interview questions
itil interview questions
informatica interview questions
extjs interview questions

Gokul Ravi said...

sap bi interview questions
hive interview questions
seo interview questions
as400 interview questions
wordpress interview questions
accounting interview questions
basic accounting and financial interview questions

Anonymous said...

nice blogs about financial accounting at The Basic Financial training in bangalore

Unknown said...

If you see those data set of google you will find out there has numbers of data which are used as spam sectors. https://www.phdproposal.net/how-to-write-an-a-non-empirical-dissertation/ to check the recent guide and tips about the academic writing.

Arthur Mendoza said...

None should have talk about the dataset of google as they literally had millions of them and all of them are decorated. http://www.gifellowship.com/why-us/ to find out more helpful tips on writing.

Arthur Mendoza said...

This is amazing news for those reader as there has so many option for reading the ebook and it is all free for you now. check it to find out more helpful tips on writing.

Mari said...

Google brings a good opportunities for the people and i hope every one are enjoying this technology so more. www.grammarsoftware.info/paperrater-com-review
where we can get more news.

Anonymous said...

Selenium is one of the most popular automated testing tool used to automate various types of applications. Selenium is a package of several testing tools designed in a way for to support and encourage automation testing of functional aspects of web-based applications and a wide range of browsers and platforms and for the same reason, it is referred to as a Suite.

Selenium Interview Questions and Answers
Javascript Interview Questions
Human Resource (HR) Interview Questions