Wednesday, March 22, 2006

Challenge: Online Learning for Cell Phone Messaging

If you've used cell phones to send text messages you probably know about their auto-complete feature. For those that haven't -- each digit corresponds to 3 or 4 letters, you enter the digits consistent with your word, and it tries to guess which word you meant. For instance you enter "43556", and it will automatically guess "hello". But if you enter "785" to mean "SVM", it'll probably guess "run", and you'll have to back-up and re-enter the word. The challenge is for the phone to guess the right word.

Since new abbreviations spring up all the time, there's no dictionary that can cover all the words used in text-messages, so an ideal cell-phone will have an algorithm that can adapt to the user. The interesting question is how well can you do?

What makes this domain different from other sources of text is that it's a conversation, consisting of a series of short posts, containing colloquial grammar, bad spelling and abbreviations. I ran some simple algorithms on a similar dataset - 1.5M words of posts from a single person from an online chatroom.

The simplest algorithm is to return the most recent word encountered, consistent with given numbering. You can also return the most frequent word, or have a compound learner that returns one or the other. Cost of 1 is incurred every time an incorrect word is guessed. Here's what the curves look like

The "best rate" curve represents the cost incurred if we only make errors on words that haven't been seen before. After 1.5M words, this "algorithm" makes mistakes about 20% of the time relative to the simple learners, which means there's a lot of room for improvement. And since this "algorithm" only sees 2-3 candidates for each word entered, there's no need to worry about efficient inference.

So here's a challenge -- can you approach the "best rate" curve any closer without using any other sources of text? (like dictionaries) If so, I'd like to hear from you! Here's the simplest script to generate an evaluation like above.

(BTW, I have several GB's of chatlogs, and I'll make more data available if there's demand)


Anonymous said...

Quick recommendation; the algorithm should know when a new message has been started. For example; I added:

class SimpleLearner:
def newmsg():

def evaluate(learner,fn):
for line in open(fn):

Yaroslav said...

That's true, start of the message is useful information.

Yaroslav said...

I realized online learning may be too hard to do right away, so I added more data to test algorithms in the conventional "supervised learning" format, before converting them into online learners

Yang Zhang said...

To Yaroslav:
I'm afraid these drafts are not available until they are published somewhere. Maybe I can provide some materials for you at this moment. The open source spelling program, Aspell, maybe statisy your needs because of its simplicity & efficiency.

Google Talk: