Machine Learning, etc: Challenge: Online Learning for Cell Phone Messaging

Wednesday, March 22, 2006

Challenge: Online Learning for Cell Phone Messaging

If you've used cell phones to send text messages you probably know about their auto-complete feature. For those that haven't -- each digit corresponds to 3 or 4 letters, you enter the digits consistent with your word, and it tries to guess which word you meant. For instance you enter "43556", and it will automatically guess "hello". But if you enter "785" to mean "SVM", it'll probably guess "run", and you'll have to back-up and re-enter the word. The challenge is for the phone to guess the right word.

Since new abbreviations spring up all the time, there's no dictionary that can cover all the words used in text-messages, so an ideal cell-phone will have an algorithm that can adapt to the user. The interesting question is how well can you do?

What makes this domain different from other sources of text is that it's a conversation, consisting of a series of short posts, containing colloquial grammar, bad spelling and abbreviations. I ran some simple algorithms on a similar dataset - 1.5M words of posts from a single person from an online chatroom.

The simplest algorithm is to return the most recent word encountered, consistent with given numbering. You can also return the most frequent word, or have a compound learner that returns one or the other. Cost of 1 is incurred every time an incorrect word is guessed. Here's what the curves look like

The "best rate" curve represents the cost incurred if we only make errors on words that haven't been seen before. After 1.5M words, this "algorithm" makes mistakes about 20% of the time relative to the simple learners, which means there's a lot of room for improvement. And since this "algorithm" only sees 2-3 candidates for each word entered, there's no need to worry about efficient inference.

So here's a challenge -- can you approach the "best rate" curve any closer without using any other sources of text? (like dictionaries) If so, I'd like to hear from you! Here's the simplest script to generate an evaluation like above.

(BTW, I have several GB's of chatlogs, and I'll make more data available if there's demand)

13 comments:

Anonymous said...: Quick recommendation; the algorithm should know when a new message has been started. For example; I added:

class SimpleLearner:
...
def newmsg():
return
....

def evaluate(learner,fn):
...
for line in open(fn):
learner.newmsg
...; 11:09 PM
Yaroslav Bulatov said...: That's true, start of the message is useful information.; 11:23 PM
Yaroslav Bulatov said...: I realized online learning may be too hard to do right away, so I added more data to test algorithms in the conventional "supervised learning" format, before converting them into online learners
http://web.engr.oregonstate.edu/~bulatov/research/reports/autocomplete/describe.html; 10:09 AM
Anonymous said...: To Yaroslav:
I'm afraid these drafts are not available until they are published somewhere. Maybe I can provide some materials for you at this moment. The open source spelling program, Aspell, maybe statisy your needs because of its simplicity & efficiency.

MSN: yangzhang_chn@hotmail.com
or
Google Talk: zeddius@gmail.com; 3:00 AM
Charles Kee said...: In recent time this online learning is really for all types of people and you could easily use these sources to know your things. visit site that is the best one place to know about the quality writing rules.; 10:55 AM
Charles Kee said...: It is really good decision to think about the biography and the unit of it. When you have free time at home, I think it is best choice to write you things on biography side. http://www.personalstatementfellowship.com/personal-statement-service/ you'll be find some helpful information about the academic writing service.; 11:06 AM
Anonymous said...: The admission of the nursing center is on the board and there would be a huge chance for the people who wants tog et admit here. http://www.researchstatement.com/our-statement-of-research-services/statement-of-research-goals/ for the students that is very helpful for the writing services.; 11:01 AM
naina said...: I have a smartphone I am learning about Cell Phone Messaging Click here . I know how to do it but I am here to know more from you.; 1:39 AM
Zeem said...: There is the online learning cell for the people who want to learn and study in the better phase. Just get your deal more good with http://www.socialworkpersonalstatement.net/best-social-work-personal-statement-examples/ and can catch the best and the healthy life style in this field of the happiness.; 7:41 AM
draj said...: Excellent machine learning blog,thanks for sharing...
Seo Internship in Bangalore
Smo Internship in Bangalore
Digital Marketing Internship Program in Bangalore; 11:40 PM
Mold Removal Shelby said...: Great post thanks; 7:48 AM
Black leather jackets said...: What a remarkable composition, I look forward to visiting Here .However, I would like to apply for admission into this great institution, if possible. Also for those looking for an admission process into the or eligibility, you can look up this composition; 9:59 AM
Frederick said...: FixnGoTX Garage Door Repair offers fast, reliable, and affordable solutions for all your garage door needs. From repairs to installations, our expert technicians ensure smooth and safe operation. We provide quality service with a focus on customer satisfaction. Trust fixngotx company for all your garage door repairs and maintenance!; 4:46 AM