Tuesday, October 25, 2011

Google Internship in Vision/ML

My group has intern openings for winter and summer. Winter may be too late (but if you really want winter, ping me and I'll find out feasibility). We use OCR for Google Books, frames from YouTube videos, spam images, unreadable PDFs encountered by the crawler, images from Google's StreetView cameras, Android and few other areas. Recognizing individual character candidates is a key step in OCR system. One that machines are not very good at. Even with 0 context, humans are better. This shall not stand!

For example, when I showed the picture below to my Taiwanese coworker he immediately said that these were multiple instance of Chinese "one".



Here are 4 of those images close-up. Classical OCR approaches, have trouble with these characters.



This is a common problem for high-noise domain like camera pictures and digital text rasterized at low resolution. Some results suggest that techniques from Machine Vision can help.

For low-noise domains like Google Books and broken PDF indexing, shortcomings of traditional OCR systems are due to
1) Large number of classes (100k letters in Unicode 6.0)
2) Non-trivial variation within classes
Example of "non-trivial variation"


I found over 100k distinct instances of digital letter 'A' from just one day's crawl worth of documents from the web. Some more examples are here

Chances are that the ideas for human-level classifier are out there. They just haven't been implemented and tested in realistic conditions. We need someone with ML/Vision background to come to Google and implement a great character classifier.

You'd have a large impact if your ideas become part of Tesseract. Through books alone, your code will be run on books from 42 libraries. And since Tesseract is open-source, you'd be contributing to the main OCR effort in the open-source community.

You will get a ton of data, resources and smart people around you. It's a very low bureocracy place. You could run Matlab code on 10k cores if you really wanted, and I know someone who has launched 200k core jobs for a personal project. The infrastructure also makes things easier. Google's MapReduce can sort a petabyte of data (10 trillion strings) with 8000 machines in just 30 mins. Some of the work in our team used features coming from distributed deep belief infrastructure.


In order to get an internship position, you must pass general technical screen that I have no control of. If you are interested in more details, you could contact me directly.

Link to apply is here

4 comments:

Dick Gordon said...

Seems a backwards approach to me: you are ignoring context. Even with the example of Chinese 1s, the observer picks up a few obvious cases and then can conclude that the rest are variants. Context means, for example, that in a string of characters likely to be a word, when non-letters are embedded in the string, they usually need to be replaced by letters that make the string make sense as a word. The sense comes from a dictionary at one level, and from the sentence the word is in at the next level making grammatical sense.

There is also the challenge, which I heard a librarian express as a dream, of 3D imaging of books at sufficient resolution and contrast to machine read them without opening them and turning the pages. Soft x-ray microbeams (for scatter reduction) might work here, using computed tomography combined with compressive sensing, especially given the binary nature of most text (black ink on white paper). This would speed scanning of books by orders of magnitude, and eliminate the hard to old, brittle books.
Yours, -Dick Gordon gordonr@cc.umanitoba.ca

Dick Gordon said...

“eliminate the hard” should be “eliminate the harm”

Yaroslav Bulatov said...

Context is important too. Google Translate Group is working on that.

Dick Gordon said...

Better work with them, then. OCR one character at a time seems to me a nonstarter. What you can do is loop as follows:

1. recognize a word
2. recognize its characters
3. find best match in dictionary
4. construct an image in the font of the whole word and alternative words
5. find best match of the whole image of the word to the constructed images

Repeat above at sentence level.

Anyone at Google scanning closed books? Thanks.
Yours, -Dick