Wednesday, November 09, 2011

Google1000 dataset

This is a dataset of scans of 1000 public domain books that was released to the public at ICDAR 2007. At the time there was no public serving infrastructure, so few people actually got the 120GB dataset. It has since been hosted on Google Cloud Storage and made available for public download
  • http://commondatastorage.googleapis.com/books/icdar2007/README.txt
  • http://commondatastorage.googleapis.com/books/icdar2007/[filename]
  • [filename] goes from Volume_0000.zip to Volume_0999.zip

To download it, ie, with Python on Linux box
import os
for i in range(1000):
 if count>1: break
 fname = 'Volume_%04d.zip'%(i)
 f = 'curl -L http://commondatastorage.googleapis.com/books/icdar2007/%s -o %s'%(fname,fname)
 os.system(f)
The goal of this dataset is to facilitate research into image post-processing and accurate OCR for scanned books.

No comments: