Machine Learning, etc: Google1000 dataset

Wednesday, November 09, 2011

Google1000 dataset

This is a dataset of scans of 1000 public domain books that was released to the public at ICDAR 2007. At the time there was no public serving infrastructure, so few people actually got the 120GB dataset. It has since been hosted on Google Cloud Storage and made available for public download

http://commondatastorage.googleapis.com/books/icdar2007/README.txt
http://commondatastorage.googleapis.com/books/icdar2007/[filename]
[filename] goes from Volume_0000.zip to Volume_0999.zip

To download it, ie, with Python on Linux box

import os
for i in range(1000):
 if count>1: break
 fname = 'Volume_%04d.zip'%(i)
 f = 'curl -L http://commondatastorage.googleapis.com/books/icdar2007/%s -o %s'%(fname,fname)
 os.system(f)

The goal of this dataset is to facilitate research into image post-processing and accurate OCR for scanned books.

25 comments:

Unknown said...: In the python script you need to change
if count>1: break
to
if i>1: break; 11:22 PM
Anonymous said...: nice blogs about financial accounting at The Basic Financial training in bangalore; 3:27 AM
Unknown said...: If you see those data set of google you will find out there has numbers of data which are used as spam sectors. https://www.phdproposal.net/how-to-write-an-a-non-empirical-dissertation/ to check the recent guide and tips about the academic writing.; 11:16 PM
Unknown said...: None should have talk about the dataset of google as they literally had millions of them and all of them are decorated. http://www.gifellowship.com/why-us/ to find out more helpful tips on writing.; 1:15 AM
Unknown said...: This is amazing news for those reader as there has so many option for reading the ebook and it is all free for you now. check it to find out more helpful tips on writing.; 8:48 AM
Mari said...: Google brings a good opportunities for the people and i hope every one are enjoying this technology so more. www.grammarsoftware.info/paperrater-com-review
where we can get more news.; 12:46 PM
remo said...: THANKS FOR THE INFORMATION....
Digital Marketing Internship Program in BangaloreDigital Marketing Internship Program in Bangalore; 11:38 PM
draj said...: Excellent machine learning blog,thanks for sharing...
Seo Internship in Bangalore
Smo Internship in Bangalore
Digital Marketing Internship Program in Bangalore; 10:58 PM
abhi said...: Nice blog Thank you.

Seo Internship In Bangalore

Internship Programs in Bangalore

Digital Marketing Internship In Bangalore; 10:08 PM
john said...: Great Article
IEEE final year projects on machine learning

JavaScript Training in Chennai

Final Year Project Centers in Chennai

JavaScript Training in Chennai; 11:07 PM
Anonymous said...: python training in bangalore | python online training
aws training in Bangalore |aws online training
artificial intelligence training in bangalore | artificial intelligence online training
data science training in bangalore | data science online training
machine learning training in bangalore | machine learning online training; 9:58 PM
ANSON SPORTS said...: gym equipments manufacturers in Mumbai
buy dumbbell online in india
fitness equipment stores in india
online sports and fitness shop in india
sports and fitness store online in india
gym equipments manufacturer in delhi
buy exercise bikes in india
best home fitness equipments in india
gym equipment price in india
gym equipment in india; 3:02 AM
YOGESH GAUR said...: classified submission site list 2018; 1:37 PM
Tanvi Gupta said...: Dinkcart is vinyl die cut stickers t-shirt manufacturers in india. Our t shirt vinyl printing is occupied with giving Heat Transfer Printing Service.; 5:02 AM
jones said...: köp ketamin
comprare la ketamina
買氯胺酮
ketamint vásárolni
kupiti ketamin
Ketamin kaufen
comprar cetamina
pirkti ketamino
osta ketamiini
osta ketamiini veebis
pirkti ketamino internete
comprar cetamina online
kup ketaminę online
Ketamin online kaufen
comprar cetamina online
купить кетамин онлайн
kjøp ketamin online
köp ketamin online
buy online ketamine
buy ketamine
order ketamine online
ut ketamine online
beställ ketamin online
bestill ketamin online
заказать кетамин онлайн
Ketamin online bestellen
online ketamine bestellen
pedir ketamina en línea

call/text/whatsapp<<<<<<<< +1(505)257-5355

email...bcvsgea1124@gmail.com; 3:16 AM
phillipsgrey said...: I need to to thank you for this great read!! I definitely loved every bit of it.
EVERYTHING YOU NEED TO KNOW ABOUT BLUE COOKIES STRAIN I have got you bookmarked to look at new stuff you; 8:14 AM
lakshmibhucynix said...: I cannot thank you enough for the blog.Thanks Again. Keep writing.
Machine Learning Course in Hyderabad
Machine Learning Training in Hyderabad; 7:46 AM
Zonahobisaya said...: Góð síða : Biodata
Góð síða : Biografi
Góð síða : Zonahobisaya
Góð síða : Zonahobisaya
Góð síða : Zonahobisaya
Góð síða : Zonahobisaya
Góð síða : Zonahobisaya
Góð síða : Zonahobisaya; 11:53 PM
바카라사이트 said...: Valuable information.; 8:17 PM
엔에프엘뉴스 said...: Hello! I just would like to give a huge thumbs up for the great info.; 8:17 PM
토토사이트 said...: Thank you for this excellent piece and whilst I do not go along with this in totality, I regard the perspective.; 8:18 PM
토토사이트 토토사이트 추천 said...: THANKS FOR THE INFORMATION.; 7:38 PM
바카라사이트 바카라사이트 추천 said...: THIS IS GREAT INFORMATION THAT YOU PROVIDED TO US!; 7:38 PM
Belinda said...: I was diagnosed with Parkinson’s disease four years ago. After traditional medications stopped working, I tried a herbal treatment from NaturePath Herbal Clinic Within months, my tremors eased, balance improved, and I regained my energy. It’s been life-changing I feel like myself again. If you or a loved one has Parkinson’s, I recommend checking out their natural approach at [www.naturepathherbalclinic.com]. info@naturepathherbalclinic.com; 2:43 AM
토토사이트 said...: Great internet site! It looks very expert! Keep up the
good work!; 9:56 PM