Machine Learning, etc: 2005

Friday, December 16, 2005

Trends in Machine Learning according to Google Scholar

In a previous post I brought up the 1983 Machine Learning workshop which featured "33 papers", and it was the follow up to the 1980 Machine Learning workshop. By contrast, NIPS 2005 had 28 workshops and is just one of several international annual Machine Learning Conferences. You can see how the field grew by looking at the distribution of publication dates for articles containing phrase "machine learning" indexed by Google Scholar (normalized by total Scholar content for each year)

You can see there's a blip at 1983 when the workshop was held.

Yann LeCun quipped at NIPS closing banquet that people who joined the field in the last 5 years probably never heard of the word "Neural Network". Similar search (normalized by results for "machine learning") reveals a recent downward trend.

You can see a major upward trend starting around 1985 (that's when Yann LeCun and several others independently rediscovered backpropagation algorithm), peaking in 1992, and going downwards from then.

An even greater downward trend is seen when searching for "Expert System",

"Genetic algorithms" seem to have taken off in the 90's, and leveled off somewhat in recent years

On other hand, search for "support vector machine" shows no sign of slowing down

(1995 is when Vapnik and Cortez proposed the algorithm)

Also, "Naive Bayes" seems to be growing without bound

If I were to trust this, I would say that Naive Bayes research the hottest machine learning area right now

"HMM"'s seem to have been losing in share since 1981

(or perhaps people are becoming less likely to write things like "hmm, this result was unexpected"?)

What was the catastrophic even of 1981 that forced such a rapid extinction of HMM's (or hmm's) in scientific literature?

Finally a worrying trend is seen in the search for "artificial stupidity" divided by corresponding hits for "artificial intelligence". The 2000 through 2004 graph shows a definite updward direction.

Tuesday, December 13, 2005

pigeon-level AI

At the NIPS "Towards Human-Level AI" workshop one of the messages was that perhaps we should first try to achieve rat-level AI, and go from there.

But maybe instead we should start by achieving pigeon-level AI. Someone has already measured pigeon performance at discriminating between painting styles

Monday, December 12, 2005

Archaeology

Here's what Aleks Jakulin managed to dig up in Google newsgroup archives

"1983 INTERNATIONAL MACHINE LEARNING WORKSHOP: AN INFORMAL REPORT"
link

Some snippets

Here's Tom Dietterich again: ``I was surprised that you summarized the workshop
in terms of an "incremental" theme. I don't think incremental-ness
is particularly important--especially for expert system work.

"So the language analysis problem has been solved?!?" by Fernando Pereira, Aug 20 1983
link

" What this replacement does to modularity, testability and
reproducibility of results is sadly clear in the large amount of
published "research" in natural language analysis which is untestable
and irreproducible."

Saturday, November 12, 2005

The Key Theorem of the Learning Theory

The Key Theorem of the Learning Theory is what Vapnik calls his theorem (for instance here) that gives the necessary conditions for the Empirical Risk Minimization principle to be consistent.
The important equation is the following:

Here I'll explain the parts of the formula using the following motivating example:
Suppose our true distribution is a normal distribution with unit variance, centered at 0, and we estimate its mean from data using method of maximum likelihood. This falls into domain of Empirical Risk Minimization because it's equivalent to finding unit variance normal distribution which minimizes empirical log-loss. If this method is consistent, then our maximum likelihood estimate of the mean will be correct in the limit

In order for the theorem to apply, we need to do make sure that we can put a bound on the loss incurred by functions in our hypothesis space, so here, make an additional restriction that our fitted normals must have mean between -100 and 100.

Here's what parts of the equation stand for

1.

l is the size of our training set

2.

a is the parameter that indexes our probability distributions, in this case, it's mean of our fitted Normal. Lambda is it's domain, in this case,

This is the actual risk incurred by using estimate a instead of true estimate (0). In our case the loss is log loss, so the risk is

In other words it is the expected value of the negative logarithm of the probability assigned to observations when observations come from Normal centered at 0, but we think they come from Normal centered on a

4.

This is the risk incurred by using estimate a, estimated on a finite sample drawn from the true density. In our case it is the following

Because the sample is a random variable, empirical risk is also a random variable. We can use the method of transformations to find it's distribution. Here are distributions for empirical risk estimated from training sets of size one for different values of a.

You can see that the modes of the distributions correspond to the true risk (a^2), but the means don't.

5.

Supremum is defined as the least upper bound, and behaves lot like maximum. It is more general than the maximum however. Consider the sequence 0.9, 0.99, 0.999, ....
This sequence does not have a maximum but has a supremum.

6.

This is the simplified middle part of the equation. It represents the probability of our empirical risk underestimating true risk by more than epsilon, for a given value of a. This would be used in place of the supremum if we cared about pointwise as opposed to uniform, convergence. Intuitively, we expect this probability to get smaller for any fixed epsilon as the sample size l gets larger. Here are a few densities of risk distribution for a=4 and different values of l

This is the random variable that represents the largest deviation of the empirical risk from corresponding true risk, over all models a. It's like an order statistic, except the samples are not IID, and there are infinitely many of them.

Questions:

How can one find the distribution of \sup(R(a)-R_emp(a) in the setting above?

I came across an interesting article by Zhu and Lu talking about how non-informative priors can often be counter-intuitive. For instance suppose we are estimating parameter of a Bernoulli distribution. If we don't know anything, the most intuitive prior to use is uniform. However Jeffreys' "uninformative" prior doesn't look uniform at all:

Thursday, November 03, 2005

NIPS pre-prints

A google search reveals the following preprints associated with NIPS 2005:

Lafferty, Blei, Correlated Topic Models
Lafferty, Wasserman, Rodeo - Sparse Nonparametric Regression in High Dimensions
Tong Zhang and Rie K. Ando. Analysis of Spectral Kernel Design based Semi-supervised Learning.
Philipp Häfliger, et al, AER Building Blocks for Multi-Layer Multi-Chip Neuromorphic Vision Systems
Paninski, L. - Inferring prior probabilities from Bayes-optimal behavior
Ahrens, M., Huys, Q. & Paninski, L.. Large-scale biophysical parameter estimation in single neurons via constrained linear regression
Jack M. Wang, David J. Fleet, Aaron Hertzmann. Gaussian Process Dynamical Models.
J.-P. Vert, R. Thurman and W. S. Noble, "Kernels for gene regulatory regions",
Firas Hamze and Nando de Freitas. Hot Coupling: A Particle Approach to Inference and Normalization on Pairwise Undirected Graphs
Jason E. Holt, On the Job Training
Marco Cuturi, Kenji Fukumizu - Multiresolution Kernels
Fan Li, Yiming Yang, Eric P. Xing. From lasso regression to feature vector machine.
Jian Zhang, Zoubin Ghahramani and Yiming Yang. Learning Multiple Related Tasks using Latent Independent Component Analysis.
James Diebel, An Application of Markov Random Fields to Range Sensing
A Computational Model of Eye Movements during Object Class Detection, Wei Zhang, Dimitris Samaras, Hyejin Yang, Greg Zelinsky
Cue Integration in Figure/Ground Labeling Xiaofeng Ren, Charless Fowlkes
Nelson, JD; Cottrell, GW; Filimon, F; Sejnowski, T (2005, Dec). Optimal experimental design models of naive human information acquisition. NIPS 2005
W. Zhang, D. Samaras, H. Yang and G. Zelinsky. A Computational Model of Eye Movements during Object Class Detection
Peter Gehler and max Welling Products of "Edge-Perts"
Cesa-Bianchi, Improved risk tail bounds for on-line algorithms
Semi-supervised Learning with Penalized Probabilistic Clustering. NIPS 2005 Zhengdong Lu, Todd Leen
LeCun, et al - Off-Road Obstacle Avoidance through End-to-End Learning
Sparse Covariance Selection via Robust Maximum Likelihood Estimation Authors: Onureena Banerjee, Alexandre d'Aspremont, Laurent El Ghaoui
Glenn Fung, Romer Rosales, Balaji Krishnapuram - Learning Rankings via Convex Hull Separations
Maximum Margin Semi-Supervised Learning for Structured Variables - Y. Altun, D. McAllester, M. Belkin
A Domain Decomposition Method for Fast Manifold Learning. Zhenyue Zhang and Hongyuan Zha
C. Scott and R. Nowak, ``Learning Minimum Volume Sets,"
huys qjm / zemel r / natarajan r / dayan pm - fast population coding
Two view learning: SVM-2K, Theory and Practice" by Jason D. R. Farquhar, David R. Hardoon, Hongying Meng, John Shawe-Taylor and Sandor Szedmak
N. Loeff and A.Sorokin and H. Arora and D.A. Forsyth, ``Efficient Unsupervised Learning for Localization and Detection in Object Categories'', NIPS 2005
T. Roos, , P. Grünwald, P. Myllymäki and H.Tirri. Generalization to Unseen Case
Mooy, J., & Kappen, H.J. Validity estimates for loopy belief propagation on binary real-world networks
Jorge G. Silva, Jorge S. Marques, João M. Lemos, Selecting Landmarkpoints for Sparse manifold Learning
Fleuret, F. and Blanchard, G. - Pattern Recognition from One Example by Chopping
A Probabilistic Approach for Optimizing Spectral Clustering - R. Jin ,C.Ding and F.Kang
Keiji Miura, Masato Okada, Shun-ichi Amari - Unbiased Estimator of Shape Parameter for Spiking Irregularities under Changing Environments
Blatt D. and Hero A. O., "From weighted classification to policy search"
A Connectionist Model for Constructive Modal Reasoning. A.d'Avila Garcez. Luis C. Lamb and Dov Gabbay
J. Ting, A. D'Souza, K. Yamamoto, T. Yoshioka, D. Hoffman, S. Kakei, L. Sergio, J. Kalaska, M. Kawato, P. Strick and S. Schaal. Predicting EMG Data from M1 Neurons with Variational Bayesian Least Squares.
G. L. Li, T.-Y. Leong, Learning Causal Bayesian Network with Constraints from Domain Knowledge
G. L. Li, T.-Y. Leong, Learning Bayesian Networks with Variable Grouping Methods.
L. Itti and P. Baldi. "Bayesian Surprise Attracts Human Attention"
Q-Clustering M. Narasimhan, N. Jojic and J. Bilmes
Nando de Freitas, Yang Wang, Maryam Mahdaviani, Dustin Lang. Fast Krylov Methods for N-Body Learning.
Consistency of One-Class SVM and Related Algorithms, R. Vert and J.-P. Vert
A. J. Bell and L. C. Parra, "Maximising sensitivity in a spiking network,"
R. S. Zemel, Q. J. M. Huys, R. Natarajan, and P. Dayan, "Probabilistic computation
in spiking populations," in Advances in NIPS, 2005
C. Yang, R. Duraiswami and L. Davis: "Efficient Kernel Machines Using the Improved Fast Gauss Transform", NIPS 2005
F. Orabona. Object-based Model of the Visual Attention for Imitation
Jieping Ye, Ravi Janardan, Qi Li. Two-Dimensional Linear Discriminant Analysis
L. Song, E. Gordon, and E. Gysels, "Phase Synchrony Rates for the Recognition of Motor Imageries in BCIs"
Kakade, S., M. Seeger and D. Foster: Worst-Case Bounds for Gaussian Process Models.
Shen, Y., A. Ng and M. Seeger: Fast Gaussian Process Regression using KD-Trees.
Kevin J. Lang - Fixing two weakness of the Spectral Method
Kari Torkkola and Eugene Tuv. Ecumenical kernels from random forests
C. Molter and U. Salihoglu and H. Bersini, Storing information through complex dynamics in Recurrent Neural Networks
J. Ting, A. D'Souza, K. Yamamoto, T. Yoshioka, D. Hoffman, S. Kakei, L. Sergio, J. Kalaska, M. Kawato, P. Strick and S. Schaal. Predicting EMG Data from M1 Neurons with Variational Bayesian Least Squares
Z. Nenadic, D.S. Rizzuto, R.A. Andersen, and J.W. Burdick, "Discriminat basedfeature selection with information theoretic objective
Mooy, J., & Kappen, H.J. (2004). Validity estimates for loopy belief propagation on binary real-world networks
Lackey, J. and Colagrosso, M (2005). A Kernel Method for Polychotomous Classification
J. A. Palmer, K. Kreutz-Delgado, D. P. Wipf, and B. D. Rao, Variational EM Algorithms for Non-Gaussian Latent Variable Models
Singh, S., Barto, A., and Chentanez, N. (2005). Intrinsically motivated reinforcement learning
Yun-Gang Zhang, Changshui Zhang, Separation of Music Signals by Harmonic Structure Modeling.
Rob Powers, Yoav Shoham, New Criteria and a New Algorithm for Learning in Multi-Agent Systems
Wood, F., Roth, S., and Black, M. J., "Modeling neural population spiking activity with Gibbs distributions,
Hinton, G. E. and Nair, V. Inferring motor programs from images of handwritten digits.
Onureena Banerjee, Alexandre d'Aspremont, Laurent El Ghaoui - Sparse Covariance Selection via Robust Maximum Likelihood Estimation.
"Measuring Shared Information and Coordinated Activity in Neuronal Networks." K. Klinkner, C. Shalizi, and M. Camperi, NIPS, 2005.
Generalisation Error Bounds for Classifiers Trained with Interdependent Data Usunier N., Amini M.-R., Gallinari P
Brafman, R. I. and Shani, G. "Resolving perceptual asliasing with noisy sensors"
Griffiths, T.L., and Ghahramani, Z. (to appear) Infinite Latent Feature Models and the Indian Buffet Process
Zhang, J., Ghahramani, Z. and Yang, Y. (to appear) Learning Multiple Related Tasks using Latent Independent Component Analysis.
Murray, I., MacKay, D.J.C., Ghahramani, Z. and Skilling, J. (to appear) Nested Sampling for Potts Models.
Snelson, E. and Ghahramani, Z. (to appear) Sparse Parametric Gaussian Processes
Ghahramani, Z. and Heller, K.A. (to appear) Bayesian Sets
On the Convergence of Eigenspaces in Kernel Principal Component Analysis: G. Blanchard and L. Zwald
Poupart, P. and Boutilier, C. VDCBPI: an Approximate Scalable Algorithm for Large Scale POMDPs
T. Murayama and P. Davis, Rate distortion codes for sensor networks: A system-level analysis,
Damian Eads, Karen Glocer, Simon Perkins, and James Theiler (2005). Grammar-guided feature extraction for time series classification.
Doi, E., Balcan, D. C., & Lewicki, M. S. A theoretical analysis of robust coding over noisy overcomplete channels
Nadler, Boaz; Lafon, Stephane; Coifman, Ronald R.; Kevrekidis, Ioannis G. - Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck operators
"Abstractions and Hierarchies for Learning and Planning" Lihong Li
"Walk-sum interpretation and analysis of Gaussian belief propagation ," Jason K. Johnson, Dmitry M. Malioutov, and Alan S. Willsky
Baker C., Tenenbaum J., & Saxe R. - Bayesian models of perceiving.intentional action
Tensor Subspace Analysis - Xiaofei He, Deng Cai, and Partha Niyogi
Laplacian Score for Feature Selection - Xiaofei He, Deng Cai, and Partha Niyogi.
Shunji Satoh ``Long-Range Horizontal Connections in V1 Serve to Multi-Scale Image Reconstruction
"Faster Rates in Regression via Active Learning" Rebecca Willett with R. Castro and R. Nowak,
Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification - Ashish Kapoor, Yuan (Alan) Qi, Hyungil Ahn and Rosalind W. Picard
Maxim Raginsky, Svetlana Lazebnik - Estimation of intrinsic dimensionality using high-rate vector quantization
Ofer Dekel, Shai Shalev-Shwartz, Yoram Singer - The Forgetron: A Kernel-Based Perceptron on a Fixed Budget
Patrick Flaherty, Michael Jordan, Adam Arkin - Robust Design of Biological Experiments
Yael Niv, Nathaniel Daw, Peter Dayan - How Fast to Work: Response Vigor, Motivation and Tonic Dopamine
Learning from Data of Variable Quality - Koby Crammer, Michael Kearns, and Jennifer Wortman
R. Kondor and T. Jebara - Gaussian and Wishart Hyperkernels
L. Liao, D. Fox, and H. Kautz. - Location-Based Activity Recognition Reinforcement Learning of Local Shape in the Game of Atari-Go David Silver, Richard Sutton, Martin Müller, Markus Enzenberger

Sunday, October 16, 2005

Commercial article databases and indexes

Full texts of recent machine learning papers are usually freely available on the web (ie, google). That's often not the case for related fields like statistics/math/econometrics, so one has to rely on some indexing service that links into commercial article databases. There are hundreds of such indexing services. I tend to use the following indexes, which I think have the highest recall for ML/statistics stuff: MathSciNet, JSTOR and Google Scholar.

Here are the numbers of results for 3 queries, searching for exact phrase in title:

"exponential families":
Google Scholar -- 635
mathscinet -- 660
JSTOR -- 147

"support vector":
Google Scholar -- 3800
mathscinet -- 131
JSTOR -- 1

"cross validation":
Google Scholar -- 1530
MathSciNet -- 181
JSTOR -- 53

JSTOR has full texts for all articles it indexes. In my experience, MathSciNet had full text links to about half the articles. Google Scholar is probably lowest precision search because it includes duplicates and documents not published anywhere besides the Web.

Questions:
What are other good indexing services?

Saturday, October 15, 2005

Learning Mathematica

Here are some useful Mathematica training notebooks I came across.

Mathematica Training -- Notebooks from a 2 day Mathematica course
NKS summer school -- some notebooks from the "New Kind of Science" summer school intro to Mathematica
Programming Paradigms via Mathematica - notebooks from a course developed by Neidinger and Swallow

Thursday, March 10, 2005

The joy of scanning

I found a book scanner in our library's basement and decided to put it to good use by scanning some hard-to-find-online references on foundations of Bayesianism.

"Algebra of Probable Inference" by Cox, 1961 (aka, Why everyone should be a Bayesian). Demonstrates a functional derivation of probability theory as the unique extension of Boolean Algebra.
"Why I'm not a Bayesian" by Clark Glymour, Theory and Evidence, 1981. Criticizes Bayesian approach from the philosophy of science point of view.
"Why Glymour is a Bayesian" by R Rosenktrantz, Testing Scientific Theories, 1983
"Why isn't everyone a Bayesian?" by Efron B, American Statistician 1986. Examines reasons why not everybody was a Bayesian, as of 1986, with scorching reply from Lindley.
"Axioms of Maximum Entropy" by Skilling, MaxEnt 1988 proceedings. Sets up four practically motivated axioms, and uses them to derive maximum entropy as the unique method for picking a single probability distribution from the set of valid probability distributions.

I included the last one because there are some parallels with philosophy of Bayesianism. Following Cox, one should be a Bayesian if

They assume boolean logic
Can encode their true belief as a pdf

On other hand, following Skilling, one should use ME principle if they

Believe Skilling's axioms.
Have statements about true distribution in the form of constraints.

In both models assumption number 1 sounds more or less reasonable, whereas assumption number 2 causes discontent. With Bayesian approach, we don't know how to represent our prior knowledge as a pdf, whereas with ME approach, we don't know where to get the constraints from. However, in practice, we can often find a pdf that is close to representing the true belief of the expert, and similarly we can often find constraints that approximately rule out unfeasible distributions.

Questions

The references are over 20 years old. One newer one from 2000 by Berger looks at rising popularity of "objective" Bayesian and robust Bayesian approaches, and predicts practical Bayesianism of the future to contain both frequentist and traditional Bayesian elements. Does anyone know of more up-to-date overviews of different inductive reasoning methods?

BTW, if you are the author, and don't like the links to your work, let me know, and I'll remove them

Saturday, March 05, 2005

More on CiteULike

I've noticed some people I know starting to use CiteULike recently. Services like CiteULike are generally a good development because they increase efficiency of research: usually people share bibliography through bibliography section of their published articles, but a publication can take years to become accessible.

The immediate advantage of CiteULike is that it can fill Bib details in for you. Here are some CiteULike usage tips

Make sure you have "Post to Citeulike" button on toolbar
When adding article X, first search for X on scholar.google.com
If search results have a link to ACM, IEEE, JSTOR, Ingenta or some other supported collection you are in luck -- go there, click "Post to Citeulike" and it'll automatically extract the bibliographic information for you
If you are adding books, search for book on Amazon, and do the same procedure
For off-campus access, save electronic book/article copies to a web accessible directory, and add link to (it's only seen by you) to the book/article entry
Add RSS of new submissions with relevant tags from CiteULike to your RSS aggregator
Choose usernames such that people can find you through Google (ie, I use yaroslavvb, which google knows about)

Questions:

Any other efficiency tips?

Friday, March 04, 2005

Machine Learning journals

Here are some journals to keep an eye on, along with their RSS feeds. I picked out the list by seeing where my favourite Machine Learning/stats papers came from:

Journal of Machine Learning Research (web, rss) The journal for machine learning publications. A better and freer replacement to "Machine Learning" journal -- here's some history
IEEE Transactions on Pattern Analysis and Machine Intelligence (web, rss) -- pattern means "visual pattern"
Machine Learning journal (web, rss) -- seems more applied than JMLR
Studies in History and Philosophy of Science Part B: Modern Physics (web, rss) -- sometimes discussions of Bayesianism, Maxent
Neural Computation(web, rss) -- neural nets, neurobiology, and general machine learning
ARXIV submissions on Learning (rss) and on Information Theory (rss)
Journal of the Royal Statistical Society (web, rss) -- some of the best papers on estimation were published here
Physics Review Letters (web) -- sometimes Machine Learning related stuff
The Annals of Statistics (web) -- Czisar published here
SIAM applied math (web)
Journal of Econometrics (web, rss) -- often the same problems but with a different name
Proceedings of the Nat'l Academy of Sciences(web) -- sometimes have interesting papers in applied math/computer science/statistics sections

CWI journal club has a bigger list of journals and links to their websites, mostly AI (not ML) oriented -- http://homepages.cwi.nl/~tomas/journalclub/20050301.html

Watch out for information overload :O

Questions:

What are some other journals that Machine Learning researchers should know about?

Monday, February 28, 2005

Blogs of Note

From Geomblog -- "Machine learning blogs appear to be popping up faster than you can say 'expectation-maximization'". With that note, here are a few machine learning (and related) blogs I came across recently

Technical

Machine Learning -- Eric Altendorf is a grad student doing research on Knowledge Intensive learning here at OSU
Machine Learning -- Josek is a PhD student working on Bayesian Learning
Alpha and Beta -- (two stats grad students?)

Non-Technical

BMI students -- a group of students in a graduate BioInformatics program
Math4Poets -- math history
Treadmill to Infinity -- AI researcher at MIT
Sam Reid's blog -- grad student at U.Colorado, does Machine Learning research
Yin Zhangqi's Blog -- physics grad student. It's mostly in Chinese, but with google's translator, and good deal of imagination, you can read some things

Saturday, February 26, 2005

Knowledge Intensive Learning

When prior knowledge is given in the form of a prior, Bayesian method is the rational way of making decisions, based on Cox' arguments. However, prior knowledge doesn't usually come in a form of a prior. For instance an expert may say that X and Y are independent, or that increasing value of X will influence Y to increase. We need to figure out how to use those kinds of knowledge to bias our learners, and this is where Knowledge Intensive Learning comes in. Eric Altendorf does research in the area and is doing a series of posts on it in his new blog.

Friday, February 25, 2005

Naive Bayes to fight HIV?

Quoting -- "Microsoft scientists David Heckerman and Nebojsa Jojic were the first to pioneer the medical uses of anti-spam software. They have joined up with doctors and scientists from the University of Washington and Australia's Royal Perth Hospital to build more potent vaccines based on the data gathered by Microsoft's technologies."

article

Thursday, February 24, 2005

How to browse the web while pretending to do Machine Learning research

Browse ML-related tags on Del.icio.us (http://del.icio.us/tag/machinelearning)
Find people who bookmarked JMLR and browse their bookmarks/inbox (ie http://del.icio.us/dnaboy76)
Search for "Machine Learning" on technoratid (search)
Find people who subscribed to John Langford's blog through bloglines, and browse their subscriptions, (ie http://bloglines.com/public/Suresh)
Find people who entered ML papers in Citeulike, and browse their library (ie http://www.citeulike.org/user/brian)
At some point, get back to wor k ;)

Questions:

What are some other ways of finding interesting stuff?

Tuesday, February 22, 2005

CiteULike

I came across CiteULike on Seb's Open Research blog a week ago, tried it out, and it seems very useful. Basically it's a cross between bibliography management software like JafRef, and social bookmarking like del.icio.us. You can use it to keep track of the papers you find, and export in the BibTex format. Then you can see what other people added given paper, and see what papers they are reading. You can also add comments to papers and view other people's comments (not many at this point). Suresh has a more detailed review.

Resources:

http://www.citeulike.org/
CiteULike FAQ
Papers I've added to it recently

Thursday, February 17, 2005

Machine Learning reading groups

If you are looking for ideas for things to read, you should check out what other Machine Learning Reading Groups are reading. I compiled the following list by searching google/asking people.

This list is not very up to date. This is a permalink so feel free to link to it. Please let me know through email, or comment section about other up-to-date MLRGs you know (site must list papers)

Current MLRGs
(as of Feb 19th 2005)

Oregon State U (Roweis, Tishby, Jordan, Collins, Caticha)-- mlrg
Gatsby (McKay, ARD, LDA) -- mlrg
Humboldt-Universitat zu Berlin (ICML, KDD papers)-- mlrg
TTIC Chicago list of readings (Langford, Kakade, McAllester)seminars
University of New Mexico (Jaakkola, Mooney, Scholkopf) -- mlrg
Oregon Health and Science U. (Seeger, Geman, MacKay) -- mlrg
Washington University (Weiss, Srebro) -- mlrg
MIT (Neal, Hofmann, Tishby) -- mlrg
University of California, Irvine (Blei, Jordan, Welling) -- mlrg
University of Amsterdam (Manning, Koller, Collins, Valiant) mlrg
Hebrew University (IB, SVM, Rademacher, Chechnik) mlrg
Cornell (Jordan, Schuurmans,Cristianini) -- mlrg
South Carolina (Kakade, Dresner, Feigenbaum) mlrg
U. Alberta (Sutton, RL, MaxEnt)-- mlrg
Weizemann Institute, (vision oriented, Jordan, Buntine) -- mlrg
Helsinki U of Tech (mostly neuroscience, Sejnowski) -- mlrg
University of London (vision oriented) -- mlrg
Rutgers (Michael Jordan's notes) -- mlrg
UMass (vision oriented, ICA, kernel PCA) -- mlrg
Edinburgh (NLP group, Hoffman, Manning,Collins) -- mlrg
U. of California, San Diego (Tishby, Lafferty, Joachims, Herbrich) -- mlrg
UTexas (Aggarwal, Altun, Manning) -- mlrg

Seminars,Journal Clubs,slightly outdated MLRGs

Columbia (November) -- mlrg
U.Toronto (July)-- mlrg
Tufts University (December) -- mlrg
Australian National U. (December)-- mlrg
Centre Universitaire d'Informatique (June) --journal club
Microsoft Research (Herbrich, large-margin, factor graphs, kernels) -- mlrg
CWI Journal Club (Netherlands) journal club
Seminars at NYU (Langford, Saul) seminars
Graphical Models Reading Group at CMU (about 10 months old) mlrg

Who is being read?
Below is the list of most popular authors ranked by how many groups in the "current MLRGs" section had them listed:

9 - Koller
8 - Jordan
7 - Weiss
6 - Taskar, Saul, McCallum, Lafferty, Ghahramani, Friedman, Collins
5 - Roweis, Pereira, Ng, Domingos
4 - Zhu, Tishby, Russell, Roth, Li, Kearns, Jaakkola, Guestrin, Blei
3 - Zhou, Yair, Wong, Wellner, Tenenbaum, Slonim, Singh, Singer, Scholkopf, Schapire, Mooney, Mansour, Manning, MacKay, Joachims, Hofmann, Griffiths, Freund, Freeman, Dengyong, Breiman, Bartlett, Altun

Questions:

What other reading groups have websites?
What are other ways of finding interesting papers?

Category: MachineLearning

Saturday, February 12, 2005

Tensor Methods for Optimization

Second-order methods use an exact Hessian, or a compact approximation to direct the search for the minimum. They are called second-order because the Hessian they use, ie, second derivative, is the second order term in the Taylor expansion of the function. What if our function is not a quadratic, and we wanted to guide our minimization using some 3rd or 4th order information? That's where tensor methods come in.

Description

Main idea to Schnabel's tensor method is to select a set of p points from previous evaluations such that they are mostly linearly independent (each new direction makes at least a 45 degree angle with old direction). The idea is then to try to interpolate function values at those points by fitting a 4th order model with special form of 3rd and 4th order terms. Because of the restricted form of these terms, the overhead associated with this step is not significant over the standard Newton-Gauss algorithm. Because it approximates function with 4th order expansion, the number of problems for which this method converges was larger. Schnabel also proves superlinear convergence of this method for some cases when Newton's method converges linearly.

Performance

Seth Greenblatt tried tensor method on some econometric models, and results were 1.5-3 times reduction in number of function evaluations. He compared it against Newton-Rhapson method, using exact matrix inversion (O(n^3)) and the extra overhead of for the tensor method wasn't significant. The problems he tested on were econometric models including Lucas-Rapping model and Berndt-Wood KLEM model. Lucas-Rapping model in his thesis was basically logistic regression with some econometric semantics.

Schnabel's paper in SIAM J Opt also used explicit Hessian, and O(n^3) steps to invert it. His 1998 paper introduces implementation of tensor method, TenSolve, which uses same approach as Gauss-Newton to approximate second order term from gradients, which means O(n^2) space and time requirement. Asymptotic overhead of using tensors was O(n^1.5)

Thoughts

TenSolve stores the approximation to the Hessian explicitly, so that takes O(n^2) space. It seems conceptually possible to reduce storage to O(n) if we use a method analogous to limited memory BFGS to store the Hessian. If that worked, it would result in a tensor optimization method with O(n) storage complexity and O(n^1.5) computation complexity

What are some applications of Machine Learning where an O(n^1.5) method would be acceptable?
What would be a good dataset on which to test CRF training using this method?

Resources:

Schnabel, Robert B and Ta-Tung Chow (1991) "Tensor Methods for Unconstrained Optimization Using Second Derivatives," SAIM J. Optimization 1,3,293-315
Ali Bouaricha, Robert B. Schnabel: Algorithm 768: TENSOLVE: A Software Package for Solving Systems of Nonlinear Equations and Nonlinear Least-Squares Problems Using Tensor Methods. ACM Trans. Math. Softw. 23 (2): 174-195 (1997)
Greenblatt, "Tensor Methods for Full-Information Maximum Likelihood Estimation", PhD thesis, George Washington University
TenSolve (Implementation of Schnabel's tensor optimization method): http://www.netlib.org/toms/768

Friday, February 11, 2005

Uninformative Priors

This prior seems to favour values of theta close to 0 or 1, so how can it be uninformative?

The intuition they provide is the following: think of Maximum Likelihood estimator as the most uninformative estimate of parameters. Then we can rank informativeness of priors by how closely the Bayesian Posterior Mean estimate comes to the maximum likelihood estimate.

If we observed k heads out of n tosses, we get following estimates:

1. MLE: k/n
2. PM using Jeffrey's prior (k+1/2)/(n+1)
3. Laplace smoothing (Uniform) (k+1)/(n+2)
4. PM using Beta(a,b) prior (k+a)/(n+a+b)

You can see that using this informal criteria for informativeness, Jeffreys' prior is more uninformative than Uniform, and Beta(0,0) (the limiting distribution) is the least informative of all, because then Posterior Mean estimate will coincide with the MLE.

Resources:

Zhu, Lu, The Counter-intuitive Non-informative Prior for the Bernoulli Family, Journal of Statistics Education Volume 12, Number 2 (2004), http://www.amstat.org/publications/jse/v12n2/zhu.pdf
Kass, Wasserman, "The Selection of Prior Distributions by Formal Rules", JASA 96
Anne Randi Syversveen. "Noninformative Bayesian priors. Interpretation and problems with construction and applications" (1998), unpublished http://www.math.ntnu.no/preprint/statistics/1998/S3-1998.ps

Wednesday, February 09, 2005

Why I started this blog

I've been doing research in Machine Learning (Conditional Random Fields and related) for the last year. This is a place for me to organize my thoughts and write down interesting new things I come across (but feel free to add comments). Some of the things I may be adding:

Comments on papers
Mini-tutorials on ML-related topics
Links to online Machine Learning resources

It won't be linear in nature as I'll revise old posts when relevant information comes up. Maybe in the future I'll switch to a setup similar to Cosma Shalizi's notebooks, but at this point, blog is the most practical thing to do.

Resources:

Cosma Shalizi's notebooks http://www.cscs.umich.edu/~crshalizi/notebooks/

Monday, February 07, 2005

Lagrange multipliers: equality constraints

Constrained optimization often comes up in machine learning.

Finding lengths of best instantaneous code when distribution of source is known
Finding maximum entropy distribution subject to given constraints
Minimizing distortion specified by the amount of information X provides about Y (Tishby's, Information Bottleneck method)

All of the examples above can be solved using the method of Lagrange multipliers. To understand motivation for Lagrange multipliers with equality constraints, you should check out Dan Klein's excellent tutorial.

However, you don't have to understand them to be able to go through the math. Here's how you can do it in 5 easy steps:

Write down the thing you are optimizing as J(x)

Write down your constraints as g_i(x)=0
Define L(x,l1,l2,...)=J(x) + l1 g1(x) + l2 g2(x) +...
Differentiate L with respect to each variable and set to 0
Solve this system of equations

Small Example

Suppose you want to maximize

subject to

Your problem looks like this:

(black line is where you solution must lie)

Then our only constraint function is g(x,y)=x-y+1=0. Our L (Langrangian) then becomes

differentiating and setting to 0 we get

Solve this system of 3 equations by substitution and you get x=-0.5, y=0.5, lambda=-1. The lagrange multiplier lambda_i represents ratio of the gradients of objective function J and i'th constraint function g_i at the solution point (that makes sense because they point in the same direction)

Bigger Example

For a more realistic example, consider maximizing entropy over a set of discrete distributions over positive integers where the first moment is fixed. Since we have an infinite number of possible X's, differentiating the Lagrangian gives us an countably infinite system of equations. But that's not actually a problem (unless you try to write them all down). Here's the detailed derivation

http://yaroslavvb.com/research/reports/fuzzy/fuzzy_lagrange.ps

So you can see that highest entropy discrete distribution with fixed mean is geometric distribution. We can solve the same problem for continuous X's in the positive domain, and get exponential distribution. If we allow X's to span whole reals, there will be no solution for the same reason as why "uniform distribution over all reals" doesn't exist.

Questions:

What to use for complicated 3d graphs? I tried Mathematica, but not sure how to combine two 3d graphs into one
Can anyone do Exercise 1. in Chapter 5 of Christianini's SVM book?
If we use Robinson's non-standard Analysis, could we sensibly define "uniform distribution over all integers"?

Resources:

Dan Klein's "Lagrange Multipliers without Permanent Scarring" http://www-diglib.stanford.edu/~klein/lagrange-multipliers.pdf
Chapter 5 of N. Cristianini and J. Shawe-Taylor Support-Vector Machines book http://www.support-vector.net/chapter_5.html
Chapter 5 of Stephen Boyd and Lieven Vandenberghe "Convex Optimization" http://www.stanford.edu/~boyd/cvxbook.html
Bertsekas, Dimitri P. "Constrained optimization and Lagrange multiplier methods" ISBN: 0120934809

Saturday, February 05, 2005

Interesting blogs, newsgroups, chatrooms

Here are some things that someone who likes math, Machine Learning or statistics may find interesting

Technical:

John Langford's blog -- http://hunch.net/ , high quality technical posts on Machine Learning
Lance Fortnow's blog -- http://weblog.fortnow.com/ , Computational Complexity and related
Notebooks -- http://www.cscs.umich.edu/~crshalizi/notebooks/ , a collection of resources from Cosma Shalizi
Illinois Genetic Algorithms Laboratory (IlliGAL) -- http://illigal.blogspot.com/
AI-complete -- http://ai-complete.blogspot.com/ low-traffic, but high quality AI blog
Neurodudes -- http://www.neurodudes.com/ Neuro-biological standpoint
Statistical Modeling, Causal Inference, and Social Science -- http://www.stat.columbia.edu/~cook/movabletype/mlm/

non-Technical:

AI Knowledge (http://www.karmachakra.com/aiknowledge) -- non-technical AI-related
Ernie's 3D Pancakes -- http://3dpancakes.typepad.com/ernie/ Jeff is a "computational geometer"
"Occasional thoughts by physicist Michael Nielsen" -- http://www.qinfo.org/people/nielsen/blog/
Ferndando Pereira's blog -- http://radio.weblogs.com/0100167/
Language log -- http://itre.cis.upenn.edu/~myl/languagelog/
Hanna Wallach is doing a PhD in Machine Learning -- http://www.srcf.ucam.org/~hmw26/join-the-dots/
The Alpha and Omega -- http://mahalanobis.twoday.net/ interesting snippets from an economist
Thesilog -- http://www.sologen.net/thesilog/ Amir does research in control theory

Newsgroups:

comp.lang.ai
comp.lang.ai.neural-nets
sci.stat.math
Machine Learning (http://groups-beta.google.com/group/Machine-Learning)

Mailing Lists:

Connectionists (calls for papers, jobs): http://www.cnbc.cmu.edu/Resources/connectionists.shtml
COLT mailing list (more calls for papers): http://mail.cs.uiuc.edu/mailman/listinfo/colt
UAI mailing list (if you can get on it): http://www.auai.org/

Chatrooms:

irc.freenode.org #MachineLearning

Finding Blogs:

After looking at "Referring Link" stats of people browsing this blog, I discovered two new ways of finding relevant blogs

Go to http://blo.gs and search for "Machine Learning"
Go to http://www.technorati.com/, search for Machine Learning related links, and it'll give you other blogs that have that link

I found a few people who came to this blog after searching for mentions of illigal.blogspot.com , and their IP's traced to Illinois, I wonder why ;)

Feel free to post other cool blogs, or other methods of finding them

Thursday, February 03, 2005

Getting probabilities from students

A comment on Computational Complexity blog blog inspired me to look at the following problem

Suppose you give true/false test to students and want them to put down their subjective probabilities that answer is true/false. To make them honest, you would want to award points in such a way, that in order to maximize their expected points, the student would need to report their correct probabilities. Which utility function to use? It turns out, f(x)=A log x + B is one such family. Here's the derivation

Questions:
1) Does any other such utility function exist?
2) What are other fun ways to torture students?

Here are graphs of three possible functions that can be used for eliciting probabilities from students. They are all concave, maybe one can prove any suitable function must be concave?