Friday, December 16, 2005

Trends in Machine Learning according to Google Scholar

In a previous post I brought up the 1983 Machine Learning workshop which featured "33 papers", and it was the follow up to the 1980 Machine Learning workshop. By contrast, NIPS 2005 had 28 workshops and is just one of several international annual Machine Learning Conferences. You can see how the field grew by looking at the distribution of publication dates for articles containing phrase "machine learning" indexed by Google Scholar (normalized by total Scholar content for each year)
You can see there's a blip at 1983 when the workshop was held.

Yann LeCun quipped at NIPS closing banquet that people who joined the field in the last 5 years probably never heard of the word "Neural Network". Similar search (normalized by results for "machine learning") reveals a recent downward trend.

You can see a major upward trend starting around 1985 (that's when Yann LeCun and several others independently rediscovered backpropagation algorithm), peaking in 1992, and going downwards from then.

An even greater downward trend is seen when searching for "Expert System",

"Genetic algorithms" seem to have taken off in the 90's, and leveled off somewhat in recent years

On other hand, search for "support vector machine" shows no sign of slowing down

(1995 is when Vapnik and Cortez proposed the algorithm)

Also, "Naive Bayes" seems to be growing without bound

If I were to trust this, I would say that Naive Bayes research the hottest machine learning area right now

"HMM"'s seem to have been losing in share since 1981

(or perhaps people are becoming less likely to write things like "hmm, this result was unexpected"?)

What was the catastrophic even of 1981 that forced such a rapid extinction of HMM's (or hmm's) in scientific literature?

Finally a worrying trend is seen in the search for "artificial stupidity" divided by corresponding hits for "artificial intelligence". The 2000 through 2004 graph shows a definite updward direction.

Tuesday, December 13, 2005

pigeon-level AI

At the NIPS "Towards Human-Level AI" workshop one of the messages was that perhaps we should first try to achieve rat-level AI, and go from there.

But maybe instead we should start by achieving pigeon-level AI. Someone has already measured pigeon performance at discriminating between painting styles

Monday, December 12, 2005


Here's what Aleks Jakulin managed to dig up in Google newsgroup archives


Some snippets

Here's Tom Dietterich again: ``I was surprised that you summarized the workshop
in terms of an "incremental" theme. I don't think incremental-ness
is particularly important--especially for expert system work.

"So the language analysis problem has been solved?!?" by Fernando Pereira, Aug 20 1983

" What this replacement does to modularity, testability and
reproducibility of results is sadly clear in the large amount of
published "research" in natural language analysis which is untestable
and irreproducible."

Saturday, November 12, 2005

The Key Theorem of the Learning Theory

The Key Theorem of the Learning Theory is what Vapnik calls his theorem (for instance here) that gives the necessary conditions for the Empirical Risk Minimization principle to be consistent.
The important equation is the following:

Here I'll explain the parts of the formula using the following motivating example:
Suppose our true distribution is a normal distribution with unit variance, centered at 0, and we estimate its mean from data using method of maximum likelihood. This falls into domain of Empirical Risk Minimization because it's equivalent to finding unit variance normal distribution which minimizes empirical log-loss. If this method is consistent, then our maximum likelihood estimate of the mean will be correct in the limit

In order for the theorem to apply, we need to do make sure that we can put a bound on the loss incurred by functions in our hypothesis space, so here, make an additional restriction that our fitted normals must have mean between -100 and 100.

Here's what parts of the equation stand for


l is the size of our training set


a is the parameter that indexes our probability distributions, in this case, it's mean of our fitted Normal. Lambda is it's domain, in this case,


This is the actual risk incurred by using estimate a instead of true estimate (0). In our case the loss is log loss, so the risk is

In other words it is the expected value of the negative logarithm of the probability assigned to observations when observations come from Normal centered at 0, but we think they come from Normal centered on a


This is the risk incurred by using estimate a, estimated on a finite sample drawn from the true density. In our case it is the following

Because the sample is a random variable, empirical risk is also a random variable. We can use the method of transformations to find it's distribution. Here are distributions for empirical risk estimated from training sets of size one for different values of a.

You can see that the modes of the distributions correspond to the true risk (a^2), but the means don't.


Supremum is defined as the least upper bound, and behaves lot like maximum. It is more general than the maximum however. Consider the sequence 0.9, 0.99, 0.999, ....
This sequence does not have a maximum but has a supremum.


This is the simplified middle part of the equation. It represents the probability of our empirical risk underestimating true risk by more than epsilon, for a given value of a. This would be used in place of the supremum if we cared about pointwise as opposed to uniform, convergence. Intuitively, we expect this probability to get smaller for any fixed epsilon as the sample size l gets larger. Here are a few densities of risk distribution for a=4 and different values of l


This is the random variable that represents the largest deviation of the empirical risk from corresponding true risk, over all models a. It's like an order statistic, except the samples are not IID, and there are infinitely many of them.

  • How can one find the distribution of \sup(R(a)-R_emp(a) in the setting above?
I came across an interesting article by Zhu and Lu talking about how non-informative priors can often be counter-intuitive. For instance suppose we are estimating parameter of a Bernoulli distribution. If we don't know anything, the most intuitive prior to use is uniform. However Jeffreys' "uninformative" prior doesn't look uniform at all:

Thursday, November 03, 2005

NIPS pre-prints

A google search reveals the following preprints associated with NIPS 2005:

  • Lafferty, Blei, Correlated Topic Models
  • Lafferty, Wasserman, Rodeo - Sparse Nonparametric Regression in High Dimensions
  • Tong Zhang and Rie K. Ando. Analysis of Spectral Kernel Design based Semi-supervised Learning.
  • Philipp Häfliger, et al, AER Building Blocks for Multi-Layer Multi-Chip Neuromorphic Vision Systems
  • Paninski, L. - Inferring prior probabilities from Bayes-optimal behavior
  • Ahrens, M., Huys, Q. & Paninski, L.. Large-scale biophysical parameter estimation in single neurons via constrained linear regression
  • Jack M. Wang, David J. Fleet, Aaron Hertzmann. Gaussian Process Dynamical Models.
    J.-P. Vert, R. Thurman and W. S. Noble, "Kernels for gene regulatory regions",
  • Firas Hamze and Nando de Freitas. Hot Coupling: A Particle Approach to Inference and Normalization on Pairwise Undirected Graphs
  • Jason E. Holt, On the Job Training
  • Marco Cuturi, Kenji Fukumizu - Multiresolution Kernels
  • Fan Li, Yiming Yang, Eric P. Xing. From lasso regression to feature vector machine.
  • Jian Zhang, Zoubin Ghahramani and Yiming Yang. Learning Multiple Related Tasks using Latent Independent Component Analysis.
  • James Diebel, An Application of Markov Random Fields to Range Sensing
  • A Computational Model of Eye Movements during Object Class Detection, Wei Zhang, Dimitris Samaras, Hyejin Yang, Greg Zelinsky
  • Cue Integration in Figure/Ground Labeling Xiaofeng Ren, Charless Fowlkes
  • Nelson, JD; Cottrell, GW; Filimon, F; Sejnowski, T (2005, Dec). Optimal experimental design models of naive human information acquisition. NIPS 2005
  • W. Zhang, D. Samaras, H. Yang and G. Zelinsky. A Computational Model of Eye Movements during Object Class Detection
  • Peter Gehler and max Welling Products of "Edge-Perts"
  • Cesa-Bianchi, Improved risk tail bounds for on-line algorithms
  • Semi-supervised Learning with Penalized Probabilistic Clustering. NIPS 2005 Zhengdong Lu, Todd Leen
  • LeCun, et al - Off-Road Obstacle Avoidance through End-to-End Learning
  • Sparse Covariance Selection via Robust Maximum Likelihood Estimation Authors: Onureena Banerjee, Alexandre d'Aspremont, Laurent El Ghaoui
  • Glenn Fung, Romer Rosales, Balaji Krishnapuram - Learning Rankings via Convex Hull Separations
  • Maximum Margin Semi-Supervised Learning for Structured Variables - Y. Altun, D. McAllester, M. Belkin
  • A Domain Decomposition Method for Fast Manifold Learning. Zhenyue Zhang and Hongyuan Zha
  • C. Scott and R. Nowak, ``Learning Minimum Volume Sets,"
  • huys qjm / zemel r / natarajan r / dayan pm - fast population coding
  • Two view learning: SVM-2K, Theory and Practice" by Jason D. R. Farquhar, David R. Hardoon, Hongying Meng, John Shawe-Taylor and Sandor Szedmak
  • N. Loeff and A.Sorokin and H. Arora and D.A. Forsyth, ``Efficient Unsupervised Learning for Localization and Detection in Object Categories'', NIPS 2005
  • T. Roos, , P. Grünwald, P. Myllymäki and H.Tirri. Generalization to Unseen Case
  • Mooy, J., & Kappen, H.J. Validity estimates for loopy belief propagation on binary real-world networks
  • Jorge G. Silva, Jorge S. Marques, João M. Lemos, Selecting Landmarkpoints for Sparse manifold Learning
  • Fleuret, F. and Blanchard, G. - Pattern Recognition from One Example by Chopping
  • A Probabilistic Approach for Optimizing Spectral Clustering - R. Jin ,C.Ding and F.Kang
  • Keiji Miura, Masato Okada, Shun-ichi Amari - Unbiased Estimator of Shape Parameter for Spiking Irregularities under Changing Environments
  • Blatt D. and Hero A. O., "From weighted classification to policy search"
  • A Connectionist Model for Constructive Modal Reasoning. A.d'Avila Garcez. Luis C. Lamb and Dov Gabbay
  • J. Ting, A. D'Souza, K. Yamamoto, T. Yoshioka, D. Hoffman, S. Kakei, L. Sergio, J. Kalaska, M. Kawato, P. Strick and S. Schaal. Predicting EMG Data from M1 Neurons with Variational Bayesian Least Squares.
  • G. L. Li, T.-Y. Leong, Learning Causal Bayesian Network with Constraints from Domain Knowledge
  • G. L. Li, T.-Y. Leong, Learning Bayesian Networks with Variable Grouping Methods.
  • L. Itti and P. Baldi. "Bayesian Surprise Attracts Human Attention"
  • Q-Clustering M. Narasimhan, N. Jojic and J. Bilmes
  • Nando de Freitas, Yang Wang, Maryam Mahdaviani, Dustin Lang. Fast Krylov Methods for N-Body Learning.
  • Consistency of One-Class SVM and Related Algorithms, R. Vert and J.-P. Vert
  • A. J. Bell and L. C. Parra, "Maximising sensitivity in a spiking network,"
  • R. S. Zemel, Q. J. M. Huys, R. Natarajan, and P. Dayan, "Probabilistic computation
  • in spiking populations," in Advances in NIPS, 2005
  • C. Yang, R. Duraiswami and L. Davis: "Efficient Kernel Machines Using the Improved Fast Gauss Transform", NIPS 2005
  • F. Orabona. Object-based Model of the Visual Attention for Imitation
  • Jieping Ye, Ravi Janardan, Qi Li. Two-Dimensional Linear Discriminant Analysis
  • L. Song, E. Gordon, and E. Gysels, "Phase Synchrony Rates for the Recognition of Motor Imageries in BCIs"
  • Kakade, S., M. Seeger and D. Foster: Worst-Case Bounds for Gaussian Process Models.
  • Shen, Y., A. Ng and M. Seeger: Fast Gaussian Process Regression using KD-Trees.
  • Kevin J. Lang - Fixing two weakness of the Spectral Method
  • Kari Torkkola and Eugene Tuv. Ecumenical kernels from random forests
  • C. Molter and U. Salihoglu and H. Bersini, Storing information through complex dynamics in Recurrent Neural Networks
  • J. Ting, A. D'Souza, K. Yamamoto, T. Yoshioka, D. Hoffman, S. Kakei, L. Sergio, J. Kalaska, M. Kawato, P. Strick and S. Schaal. Predicting EMG Data from M1 Neurons with Variational Bayesian Least Squares
  • Z. Nenadic, D.S. Rizzuto, R.A. Andersen, and J.W. Burdick, "Discriminat basedfeature selection with information theoretic objective
  • Mooy, J., & Kappen, H.J. (2004). Validity estimates for loopy belief propagation on binary real-world networks
  • Lackey, J. and Colagrosso, M (2005). A Kernel Method for Polychotomous Classification
  • J. A. Palmer, K. Kreutz-Delgado, D. P. Wipf, and B. D. Rao, Variational EM Algorithms for Non-Gaussian Latent Variable Models
  • Singh, S., Barto, A., and Chentanez, N. (2005). Intrinsically motivated reinforcement learning
  • Yun-Gang Zhang, Changshui Zhang, Separation of Music Signals by Harmonic Structure Modeling.
  • Rob Powers, Yoav Shoham, New Criteria and a New Algorithm for Learning in Multi-Agent Systems
  • Wood, F., Roth, S., and Black, M. J., "Modeling neural population spiking activity with Gibbs distributions,
  • Hinton, G. E. and Nair, V. Inferring motor programs from images of handwritten digits.
  • Onureena Banerjee, Alexandre d'Aspremont, Laurent El Ghaoui - Sparse Covariance Selection via Robust Maximum Likelihood Estimation.
  • "Measuring Shared Information and Coordinated Activity in Neuronal Networks." K. Klinkner, C. Shalizi, and M. Camperi, NIPS, 2005.
  • Generalisation Error Bounds for Classifiers Trained with Interdependent Data Usunier N., Amini M.-R., Gallinari P
  • Brafman, R. I. and Shani, G. "Resolving perceptual asliasing with noisy sensors"
  • Griffiths, T.L., and Ghahramani, Z. (to appear) Infinite Latent Feature Models and the Indian Buffet Process
  • Zhang, J., Ghahramani, Z. and Yang, Y. (to appear) Learning Multiple Related Tasks using Latent Independent Component Analysis.
  • Murray, I., MacKay, D.J.C., Ghahramani, Z. and Skilling, J. (to appear) Nested Sampling for Potts Models.
  • Snelson, E. and Ghahramani, Z. (to appear) Sparse Parametric Gaussian Processes
  • Ghahramani, Z. and Heller, K.A. (to appear) Bayesian Sets
  • On the Convergence of Eigenspaces in Kernel Principal Component Analysis: G. Blanchard and L. Zwald
  • Poupart, P. and Boutilier, C. VDCBPI: an Approximate Scalable Algorithm for Large Scale POMDPs
  • T. Murayama and P. Davis, Rate distortion codes for sensor networks: A system-level analysis,
  • Damian Eads, Karen Glocer, Simon Perkins, and James Theiler (2005). Grammar-guided feature extraction for time series classification.
  • Doi, E., Balcan, D. C., & Lewicki, M. S. A theoretical analysis of robust coding over noisy overcomplete channels
  • Nadler, Boaz; Lafon, Stephane; Coifman, Ronald R.; Kevrekidis, Ioannis G. - Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck operators
  • "Abstractions and Hierarchies for Learning and Planning" Lihong Li
  • "Walk-sum interpretation and analysis of Gaussian belief propagation ," Jason K. Johnson, Dmitry M. Malioutov, and Alan S. Willsky
  • Baker C., Tenenbaum J., & Saxe R. - Bayesian models of perceiving.intentional action
  • Tensor Subspace Analysis - Xiaofei He, Deng Cai, and Partha Niyogi
  • Laplacian Score for Feature Selection - Xiaofei He, Deng Cai, and Partha Niyogi.
  • Shunji Satoh ``Long-Range Horizontal Connections in V1 Serve to Multi-Scale Image Reconstruction
  • "Faster Rates in Regression via Active Learning" Rebecca Willett with R. Castro and R. Nowak,
  • Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification - Ashish Kapoor, Yuan (Alan) Qi, Hyungil Ahn and Rosalind W. Picard
  • Maxim Raginsky, Svetlana Lazebnik - Estimation of intrinsic dimensionality using high-rate vector quantization
  • Ofer Dekel, Shai Shalev-Shwartz, Yoram Singer - The Forgetron: A Kernel-Based Perceptron on a Fixed Budget
  • Patrick Flaherty, Michael Jordan, Adam Arkin - Robust Design of Biological Experiments
  • Yael Niv, Nathaniel Daw, Peter Dayan - How Fast to Work: Response Vigor, Motivation and Tonic Dopamine
  • Learning from Data of Variable Quality - Koby Crammer, Michael Kearns, and Jennifer Wortman
  • R. Kondor and T. Jebara - Gaussian and Wishart Hyperkernels
  • L. Liao, D. Fox, and H. Kautz. - Location-Based Activity Recognition Reinforcement Learning of Local Shape in the Game of Atari-Go David Silver, Richard Sutton, Martin Müller, Markus Enzenberger

Sunday, October 16, 2005

Commercial article databases and indexes

Full texts of recent machine learning papers are usually freely available on the web (ie, google). That's often not the case for related fields like statistics/math/econometrics, so one has to rely on some indexing service that links into commercial article databases. There are hundreds of such indexing services. I tend to use the following indexes, which I think have the highest recall for ML/statistics stuff: MathSciNet, JSTOR and Google Scholar.

Here are the numbers of results for 3 queries, searching for exact phrase in title:

"exponential families":
Google Scholar -- 635
mathscinet -- 660
JSTOR -- 147

"support vector":
Google Scholar -- 3800
mathscinet -- 131
JSTOR -- 1

"cross validation":
Google Scholar -- 1530
MathSciNet -- 181
JSTOR -- 53

JSTOR has full texts for all articles it indexes. In my experience, MathSciNet had full text links to about half the articles. Google Scholar is probably lowest precision search because it includes duplicates and documents not published anywhere besides the Web.

What are other good indexing services?

Saturday, October 15, 2005

Learning Mathematica

Here are some useful Mathematica training notebooks I came across.

Mathematica Training -- Notebooks from a 2 day Mathematica course
NKS summer school -- some notebooks from the "New Kind of Science" summer school intro to Mathematica
Programming Paradigms via Mathematica - notebooks from a course developed by Neidinger and Swallow

Thursday, March 10, 2005

The joy of scanning

I found a book scanner in our library's basement and decided to put it to good use by scanning some hard-to-find-online references on foundations of Bayesianism.
  • "Algebra of Probable Inference" by Cox, 1961 (aka, Why everyone should be a Bayesian). Demonstrates a functional derivation of probability theory as the unique extension of Boolean Algebra.
  • "Why I'm not a Bayesian" by Clark Glymour, Theory and Evidence, 1981. Criticizes Bayesian approach from the philosophy of science point of view.
  • "Why Glymour is a Bayesian" by R Rosenktrantz, Testing Scientific Theories, 1983
  • "Why isn't everyone a Bayesian?" by Efron B, American Statistician 1986. Examines reasons why not everybody was a Bayesian, as of 1986, with scorching reply from Lindley.
  • "Axioms of Maximum Entropy" by Skilling, MaxEnt 1988 proceedings. Sets up four practically motivated axioms, and uses them to derive maximum entropy as the unique method for picking a single probability distribution from the set of valid probability distributions.
I included the last one because there are some parallels with philosophy of Bayesianism. Following Cox, one should be a Bayesian if
  1. They assume boolean logic
  2. Can encode their true belief as a pdf
On other hand, following Skilling, one should use ME principle if they
  1. Believe Skilling's axioms.
  2. Have statements about true distribution in the form of constraints.
In both models assumption number 1 sounds more or less reasonable, whereas assumption number 2 causes discontent. With Bayesian approach, we don't know how to represent our prior knowledge as a pdf, whereas with ME approach, we don't know where to get the constraints from. However, in practice, we can often find a pdf that is close to representing the true belief of the expert, and similarly we can often find constraints that approximately rule out unfeasible distributions.

  • The references are over 20 years old. One newer one from 2000 by Berger looks at rising popularity of "objective" Bayesian and robust Bayesian approaches, and predicts practical Bayesianism of the future to contain both frequentist and traditional Bayesian elements. Does anyone know of more up-to-date overviews of different inductive reasoning methods?

BTW, if you are the author, and don't like the links to your work, let me know, and I'll remove them

Saturday, March 05, 2005

More on CiteULike

I've noticed some people I know starting to use CiteULike recently. Services like CiteULike are generally a good development because they increase efficiency of research: usually people share bibliography through bibliography section of their published articles, but a publication can take years to become accessible.

The immediate advantage of CiteULike is that it can fill Bib details in for you. Here are some CiteULike usage tips

  1. Make sure you have "Post to Citeulike" button on toolbar
  2. When adding article X, first search for X on
  3. If search results have a link to ACM, IEEE, JSTOR, Ingenta or some other supported collection you are in luck -- go there, click "Post to Citeulike" and it'll automatically extract the bibliographic information for you
  4. If you are adding books, search for book on Amazon, and do the same procedure
  5. For off-campus access, save electronic book/article copies to a web accessible directory, and add link to (it's only seen by you) to the book/article entry
  6. Add RSS of new submissions with relevant tags from CiteULike to your RSS aggregator
  7. Choose usernames such that people can find you through Google (ie, I use yaroslavvb, which google knows about)
  • Any other efficiency tips?

Friday, March 04, 2005

Machine Learning journals

Here are some journals to keep an eye on, along with their RSS feeds. I picked out the list by seeing where my favourite Machine Learning/stats papers came from:
  • Journal of Machine Learning Research (web, rss) The journal for machine learning publications. A better and freer replacement to "Machine Learning" journal -- here's some history
  • IEEE Transactions on Pattern Analysis and Machine Intelligence (web, rss) -- pattern means "visual pattern"
  • Machine Learning journal (web, rss) -- seems more applied than JMLR
  • Studies in History and Philosophy of Science Part B: Modern Physics (web, rss) -- sometimes discussions of Bayesianism, Maxent
  • Neural Computation(web, rss) -- neural nets, neurobiology, and general machine learning
  • ARXIV submissions on Learning (rss) and on Information Theory (rss)
  • Journal of the Royal Statistical Society (web, rss) -- some of the best papers on estimation were published here
  • Physics Review Letters (web) -- sometimes Machine Learning related stuff
  • The Annals of Statistics (web) -- Czisar published here
  • SIAM applied math (web)
  • Journal of Econometrics (web, rss) -- often the same problems but with a different name
  • Proceedings of the Nat'l Academy of Sciences(web) -- sometimes have interesting papers in applied math/computer science/statistics sections
CWI journal club has a bigger list of journals and links to their websites, mostly AI (not ML) oriented --

Watch out for information overload :O

  • What are some other journals that Machine Learning researchers should know about?

Monday, February 28, 2005

Blogs of Note

From Geomblog -- "Machine learning blogs appear to be popping up faster than you can say 'expectation-maximization'". With that note, here are a few machine learning (and related) blogs I came across recently



Saturday, February 26, 2005

Knowledge Intensive Learning

When prior knowledge is given in the form of a prior, Bayesian method is the rational way of making decisions, based on Cox' arguments. However, prior knowledge doesn't usually come in a form of a prior. For instance an expert may say that X and Y are independent, or that increasing value of X will influence Y to increase. We need to figure out how to use those kinds of knowledge to bias our learners, and this is where Knowledge Intensive Learning comes in. Eric Altendorf does research in the area and is doing a series of posts on it in his new blog.

Friday, February 25, 2005

Naive Bayes to fight HIV?

Quoting -- "Microsoft scientists David Heckerman and Nebojsa Jojic were the first to pioneer the medical uses of anti-spam software. They have joined up with doctors and scientists from the University of Washington and Australia's Royal Perth Hospital to build more potent vaccines based on the data gathered by Microsoft's technologies."


Thursday, February 24, 2005

How to browse the web while pretending to do Machine Learning research

  • What are some other ways of finding interesting stuff?

Tuesday, February 22, 2005


I came across CiteULike on Seb's Open Research blog a week ago, tried it out, and it seems very useful. Basically it's a cross between bibliography management software like JafRef, and social bookmarking like You can use it to keep track of the papers you find, and export in the BibTex format. Then you can see what other people added given paper, and see what papers they are reading. You can also add comments to papers and view other people's comments (not many at this point). Suresh has a more detailed review.


Thursday, February 17, 2005

Machine Learning reading groups

If you are looking for ideas for things to read, you should check out what other Machine Learning Reading Groups are reading. I compiled the following list by searching google/asking people.

This list is not very up to date. This is a permalink so feel free to link to it. Please let me know through email, or comment section about other up-to-date MLRGs you know (site must list papers)

Current MLRGs
(as of Feb 19th 2005)
  • Oregon State U (Roweis, Tishby, Jordan, Collins, Caticha)-- mlrg
  • Gatsby (McKay, ARD, LDA) -- mlrg
  • Humboldt-Universitat zu Berlin (ICML, KDD papers)-- mlrg
  • TTIC Chicago list of readings (Langford, Kakade, McAllester)seminars
  • University of New Mexico (Jaakkola, Mooney, Scholkopf) -- mlrg
  • Oregon Health and Science U. (Seeger, Geman, MacKay) -- mlrg
  • Washington University (Weiss, Srebro) -- mlrg
  • MIT (Neal, Hofmann, Tishby) -- mlrg
  • University of California, Irvine (Blei, Jordan, Welling) -- mlrg
  • University of Amsterdam (Manning, Koller, Collins, Valiant) mlrg
  • Hebrew University (IB, SVM, Rademacher, Chechnik) mlrg
  • Cornell (Jordan, Schuurmans,Cristianini) -- mlrg
  • South Carolina (Kakade, Dresner, Feigenbaum) mlrg
  • U. Alberta (Sutton, RL, MaxEnt)-- mlrg
  • Weizemann Institute, (vision oriented, Jordan, Buntine) -- mlrg
  • Helsinki U of Tech (mostly neuroscience, Sejnowski) -- mlrg
  • University of London (vision oriented) -- mlrg
  • Rutgers (Michael Jordan's notes) -- mlrg
  • UMass (vision oriented, ICA, kernel PCA) -- mlrg
  • Edinburgh (NLP group, Hoffman, Manning,Collins) -- mlrg
  • U. of California, San Diego (Tishby, Lafferty, Joachims, Herbrich) -- mlrg
  • UTexas (Aggarwal, Altun, Manning) -- mlrg
Seminars,Journal Clubs,slightly outdated MLRGs
  • Columbia (November) -- mlrg
  • U.Toronto (July)-- mlrg
  • Tufts University (December) -- mlrg
  • Australian National U. (December)-- mlrg
  • Centre Universitaire d'Informatique (June) --journal club
  • Microsoft Research (Herbrich, large-margin, factor graphs, kernels) -- mlrg
  • CWI Journal Club (Netherlands) journal club
  • Seminars at NYU (Langford, Saul) seminars
  • Graphical Models Reading Group at CMU (about 10 months old) mlrg

Who is being read?
Below is the list of most popular authors ranked by how many groups in the "current MLRGs" section had them listed:

9 - Koller
8 - Jordan
7 - Weiss
6 - Taskar, Saul, McCallum, Lafferty, Ghahramani, Friedman, Collins
5 - Roweis, Pereira, Ng, Domingos
4 - Zhu, Tishby, Russell, Roth, Li, Kearns, Jaakkola, Guestrin, Blei
3 - Zhou, Yair, Wong, Wellner, Tenenbaum, Slonim, Singh, Singer, Scholkopf, Schapire, Mooney, Mansour, Manning, MacKay, Joachims, Hofmann, Griffiths, Freund, Freeman, Dengyong, Breiman, Bartlett, Altun

  • What other reading groups have websites?
  • What are other ways of finding interesting papers?

Saturday, February 12, 2005

Tensor Methods for Optimization

Second-order methods use an exact Hessian, or a compact approximation to direct the search for the minimum. They are called second-order because the Hessian they use, ie, second derivative, is the second order term in the Taylor expansion of the function. What if our function is not a quadratic, and we wanted to guide our minimization using some 3rd or 4th order information? That's where tensor methods come in.


Main idea to Schnabel's tensor method is to select a set of p points from previous evaluations such that they are mostly linearly independent (each new direction makes at least a 45 degree angle with old direction). The idea is then to try to interpolate function values at those points by fitting a 4th order model with special form of 3rd and 4th order terms. Because of the restricted form of these terms, the overhead associated with this step is not significant over the standard Newton-Gauss algorithm. Because it approximates function with 4th order expansion, the number of problems for which this method converges was larger. Schnabel also proves superlinear convergence of this method for some cases when Newton's method converges linearly.


Seth Greenblatt tried tensor method on some econometric models, and results were 1.5-3 times reduction in number of function evaluations. He compared it against Newton-Rhapson method, using exact matrix inversion (O(n^3)) and the extra overhead of for the tensor method wasn't significant. The problems he tested on were econometric models including Lucas-Rapping model and Berndt-Wood KLEM model. Lucas-Rapping model in his thesis was basically logistic regression with some econometric semantics.

Schnabel's paper in SIAM J Opt also used explicit Hessian, and O(n^3) steps to invert it. His 1998 paper introduces implementation of tensor method, TenSolve, which uses same approach as Gauss-Newton to approximate second order term from gradients, which means O(n^2) space and time requirement. Asymptotic overhead of using tensors was O(n^1.5)


TenSolve stores the approximation to the Hessian explicitly, so that takes O(n^2) space. It seems conceptually possible to reduce storage to O(n) if we use a method analogous to limited memory BFGS to store the Hessian. If that worked, it would result in a tensor optimization method with O(n) storage complexity and O(n^1.5) computation complexity

  • What are some applications of Machine Learning where an O(n^1.5) method would be acceptable?
  • What would be a good dataset on which to test CRF training using this method?


  • Schnabel, Robert B and Ta-Tung Chow (1991) "Tensor Methods for Unconstrained Optimization Using Second Derivatives," SAIM J. Optimization 1,3,293-315
  • Ali Bouaricha, Robert B. Schnabel: Algorithm 768: TENSOLVE: A Software Package for Solving Systems of Nonlinear Equations and Nonlinear Least-Squares Problems Using Tensor Methods. ACM Trans. Math. Softw. 23 (2): 174-195 (1997)
  • Greenblatt, "Tensor Methods for Full-Information Maximum Likelihood Estimation", PhD thesis, George Washington University
  • TenSolve (Implementation of Schnabel's tensor optimization method):

Friday, February 11, 2005

Uninformative Priors

I came across an interesting article by Zhu and Lu talking about how non-informative priors can often be counter-intuitive. For instance suppose we are estimating parameter of a Bernoulli distribution. If we don't know anything, the most intuitive prior to use is uniform. However Jeffreys' "uninformative" prior doesn't look uniform at all:

This prior seems to favour values of theta close to 0 or 1, so how can it be uninformative?

The intuition they provide is the following: think of Maximum Likelihood estimator as the most uninformative estimate of parameters. Then we can rank informativeness of priors by how closely the Bayesian Posterior Mean estimate comes to the maximum likelihood estimate.

If we observed k heads out of n tosses, we get following estimates:

1. MLE: k/n
2. PM using Jeffrey's prior (k+1/2)/(n+1)
3. Laplace smoothing (Uniform) (k+1)/(n+2)
4. PM using Beta(a,b) prior (k+a)/(n+a+b)

You can see that using this informal criteria for informativeness, Jeffreys' prior is more uninformative than Uniform, and Beta(0,0) (the limiting distribution) is the least informative of all, because then Posterior Mean estimate will coincide with the MLE.


Wednesday, February 09, 2005

Why I started this blog

I've been doing research in Machine Learning (Conditional Random Fields and related) for the last year. This is a place for me to organize my thoughts and write down interesting new things I come across (but feel free to add comments). Some of the things I may be adding:

  1. Comments on papers
  2. Mini-tutorials on ML-related topics
  3. Links to online Machine Learning resources

It won't be linear in nature as I'll revise old posts when relevant information comes up. Maybe in the future I'll switch to a setup similar to Cosma Shalizi's notebooks, but at this point, blog is the most practical thing to do.


Monday, February 07, 2005

Lagrange multipliers: equality constraints

Constrained optimization often comes up in machine learning.

  1. Finding lengths of best instantaneous code when distribution of source is known
  2. Finding maximum entropy distribution subject to given constraints
  3. Minimizing distortion specified by the amount of information X provides about Y (Tishby's, Information Bottleneck method)
All of the examples above can be solved using the method of Lagrange multipliers. To understand motivation for Lagrange multipliers with equality constraints, you should check out Dan Klein's excellent tutorial.

However, you don't have to understand them to be able to go through the math. Here's how you can do it in 5 easy steps:

Write down the thing you are optimizing as J(x)
  1. Write down your constraints as g_i(x)=0
  2. Define L(x,l1,l2,...)=J(x) + l1 g1(x) + l2 g2(x) +...
  3. Differentiate L with respect to each variable and set to 0
  4. Solve this system of equations

Small Example

Suppose you want to maximize

subject to

Your problem looks like this:

(black line is where you solution must lie)

Then our only constraint function is g(x,y)=x-y+1=0. Our L (Langrangian) then becomes

differentiating and setting to 0 we get

Solve this system of 3 equations by substitution and you get x=-0.5, y=0.5, lambda=-1. The lagrange multiplier lambda_i represents ratio of the gradients of objective function J and i'th constraint function g_i at the solution point (that makes sense because they point in the same direction)

Bigger Example

For a more realistic example, consider maximizing entropy over a set of discrete distributions over positive integers where the first moment is fixed. Since we have an infinite number of possible X's, differentiating the Lagrangian gives us an countably infinite system of equations. But that's not actually a problem (unless you try to write them all down). Here's the detailed derivation

So you can see that highest entropy discrete distribution with fixed mean is geometric distribution. We can solve the same problem for continuous X's in the positive domain, and get exponential distribution. If we allow X's to span whole reals, there will be no solution for the same reason as why "uniform distribution over all reals" doesn't exist.


  1. What to use for complicated 3d graphs? I tried Mathematica, but not sure how to combine two 3d graphs into one
  2. Can anyone do Exercise 1. in Chapter 5 of Christianini's SVM book?
  3. If we use Robinson's non-standard Analysis, could we sensibly define "uniform distribution over all integers"?


Saturday, February 05, 2005

Interesting blogs, newsgroups, chatrooms

Here are some things that someone who likes math, Machine Learning or statistics may find interesting




Mailing Lists:


  • #MachineLearning

Finding Blogs:

After looking at "Referring Link" stats of people browsing this blog, I discovered two new ways of finding relevant blogs

I found a few people who came to this blog after searching for mentions of , and their IP's traced to Illinois, I wonder why ;)

Feel free to post other cool blogs, or other methods of finding them

Thursday, February 03, 2005

Getting probabilities from students

A comment on Computational Complexity blog blog inspired me to look at the following problem

Suppose you give true/false test to students and want them to put down their subjective probabilities that answer is true/false. To make them honest, you would want to award points in such a way, that in order to maximize their expected points, the student would need to report their correct probabilities. Which utility function to use? It turns out, f(x)=A log x + B is one such family. Here's the derivation

1) Does any other such utility function exist?
2) What are other fun ways to torture students?

Here are graphs of three possible functions that can be used for eliciting probabilities from students. They are all concave, maybe one can prove any suitable function must be concave?