## Friday, December 16, 2005

### Trends in Machine Learning according to Google Scholar

You can see there's a blip at 1983 when the workshop was held.

Yann LeCun quipped at NIPS closing banquet that people who joined the field in the last 5 years probably never heard of the word "Neural Network". Similar search (normalized by results for "machine learning") reveals a recent downward trend.

You can see a major upward trend starting around 1985 (that's when Yann LeCun and several others independently rediscovered backpropagation algorithm), peaking in 1992, and going downwards from then.

An even greater downward trend is seen when searching for "Expert System",

"Genetic algorithms" seem to have taken off in the 90's, and leveled off somewhat in recent years

On other hand, search for "support vector machine" shows no sign of slowing down

(1995 is when Vapnik and Cortez proposed the algorithm)

Also, "Naive Bayes" seems to be growing without bound

If I were to trust this, I would say that Naive Bayes research the hottest machine learning area right now

"HMM"'s seem to have been losing in share since 1981

(or perhaps people are becoming less likely to write things like "hmm, this result was unexpected"?)

What was the catastrophic even of 1981 that forced such a rapid extinction of HMM's (or hmm's) in scientific literature?

Finally a worrying trend is seen in the search for "artificial stupidity" divided by corresponding hits for "artificial intelligence". The 2000 through 2004 graph shows a definite updward direction.

## Tuesday, December 13, 2005

### pigeon-level AI

But maybe instead we should start by achieving pigeon-level AI. Someone has already measured pigeon performance at discriminating between painting styles

## Monday, December 12, 2005

### Archaeology

"1983 INTERNATIONAL MACHINE LEARNING WORKSHOP: AN INFORMAL REPORT"

link

Some snippets

Here's Tom Dietterich again: ``I was surprised that you summarized the workshop

in terms of an "incremental" theme. I don't think incremental-ness

is particularly important--especially for expert system work.

"So the language analysis problem has been solved?!?" by Fernando Pereira, Aug 20 1983

link

" What this replacement does to modularity, testability and

reproducibility of results is sadly clear in the large amount of

published "research" in natural language analysis which is untestable

and irreproducible."

## Saturday, November 12, 2005

### The Key Theorem of the Learning Theory

The important equation is the following:

Here I'll explain the parts of the formula using the following motivating example:

Suppose our true distribution is a normal distribution with unit variance, centered at 0, and we estimate its mean from data using method of maximum likelihood. This falls into domain of Empirical Risk Minimization because it's equivalent to finding unit variance normal distribution which minimizes empirical log-loss. If this method is consistent, then our maximum likelihood estimate of the mean will be correct in the limit

In order for the theorem to apply, we need to do make sure that we can put a bound on the loss incurred by functions in our hypothesis space, so here, make an additional restriction that our fitted normals must have mean between -100 and 100.

Here's what parts of the equation stand for

1.

l is the size of our training set

2.

a is the parameter that indexes our probability distributions, in this case, it's mean of our fitted Normal. Lambda is it's domain, in this case,

3.

This is the actual risk incurred by using estimate a instead of true estimate (0). In our case the loss is log loss, so the risk is

In other words it is the expected value of the negative logarithm of the probability assigned to observations when observations come from Normal centered at 0, but we think they come from Normal centered on a

4.

This is the risk incurred by using estimate a, estimated on a finite sample drawn from the true density. In our case it is the following

Because the sample is a random variable, empirical risk is also a random variable. We can use the method of transformations to find it's distribution. Here are distributions for empirical risk estimated from training sets of size one for different values of a.

You can see that the modes of the distributions correspond to the true risk (a^2), but the means don't.

5.

Supremum is defined as the least upper bound, and behaves lot like maximum. It is more general than the maximum however. Consider the sequence 0.9, 0.99, 0.999, ....

This sequence does not have a maximum but has a supremum.

6.

This is the simplified middle part of the equation. It represents the probability of our empirical risk underestimating true risk by more than epsilon, for a given value of a. This would be used in place of the supremum if we cared about pointwise as opposed to uniform, convergence. Intuitively, we expect this probability to get smaller for any fixed epsilon as the sample size l gets larger. Here are a few densities of risk distribution for a=4 and different values of l

7.

This is the random variable that represents the largest deviation of the empirical risk from corresponding true risk, over all models a. It's like an order statistic, except the samples are not IID, and there are infinitely many of them.

Questions:

- How can one find the distribution of \sup(R(a)-R_emp(a) in the setting above?

## Thursday, November 03, 2005

### NIPS pre-prints

- Lafferty, Blei, Correlated Topic Models
- Lafferty, Wasserman, Rodeo - Sparse Nonparametric Regression in High Dimensions
- Tong Zhang and Rie K. Ando. Analysis of Spectral Kernel Design based Semi-supervised Learning.
- Philipp Häfliger, et al, AER Building Blocks for Multi-Layer Multi-Chip Neuromorphic Vision Systems
- Paninski, L. - Inferring prior probabilities from Bayes-optimal behavior
- Ahrens, M., Huys, Q. & Paninski, L.. Large-scale biophysical parameter estimation in single neurons via constrained linear regression
- Jack M. Wang, David J. Fleet, Aaron Hertzmann. Gaussian Process Dynamical Models.

J.-P. Vert, R. Thurman and W. S. Noble, "Kernels for gene regulatory regions", - Firas Hamze and Nando de Freitas. Hot Coupling: A Particle Approach to Inference and Normalization on Pairwise Undirected Graphs
- Jason E. Holt, On the Job Training
- Marco Cuturi, Kenji Fukumizu - Multiresolution Kernels
- Fan Li, Yiming Yang, Eric P. Xing. From lasso regression to feature vector machine.
- Jian Zhang, Zoubin Ghahramani and Yiming Yang. Learning Multiple Related Tasks using Latent Independent Component Analysis.
- James Diebel, An Application of Markov Random Fields to Range Sensing
- A Computational Model of Eye Movements during Object Class Detection, Wei Zhang, Dimitris Samaras, Hyejin Yang, Greg Zelinsky
- Cue Integration in Figure/Ground Labeling Xiaofeng Ren, Charless Fowlkes
- Nelson, JD; Cottrell, GW; Filimon, F; Sejnowski, T (2005, Dec). Optimal experimental design models of naive human information acquisition. NIPS 2005
- W. Zhang, D. Samaras, H. Yang and G. Zelinsky. A Computational Model of Eye Movements during Object Class Detection
- Peter Gehler and max Welling Products of "Edge-Perts"
- Cesa-Bianchi, Improved risk tail bounds for on-line algorithms
- Semi-supervised Learning with Penalized Probabilistic Clustering. NIPS 2005 Zhengdong Lu, Todd Leen
- LeCun, et al - Off-Road Obstacle Avoidance through End-to-End Learning
- Sparse Covariance Selection via Robust Maximum Likelihood Estimation Authors: Onureena Banerjee, Alexandre d'Aspremont, Laurent El Ghaoui
- Glenn Fung, Romer Rosales, Balaji Krishnapuram - Learning Rankings via Convex Hull Separations
- Maximum Margin Semi-Supervised Learning for Structured Variables - Y. Altun, D. McAllester, M. Belkin
- A Domain Decomposition Method for Fast Manifold Learning. Zhenyue Zhang and Hongyuan Zha
- C. Scott and R. Nowak, ``Learning Minimum Volume Sets,"
- huys qjm / zemel r / natarajan r / dayan pm - fast population coding
- Two view learning: SVM-2K, Theory and Practice" by Jason D. R. Farquhar, David R. Hardoon, Hongying Meng, John Shawe-Taylor and Sandor Szedmak
- N. Loeff and A.Sorokin and H. Arora and D.A. Forsyth, ``Efficient Unsupervised Learning for Localization and Detection in Object Categories'', NIPS 2005
- T. Roos, , P. Grünwald, P. Myllymäki and H.Tirri. Generalization to Unseen Case
- Mooy, J., & Kappen, H.J. Validity estimates for loopy belief propagation on binary real-world networks
- Jorge G. Silva, Jorge S. Marques, João M. Lemos, Selecting Landmarkpoints for Sparse manifold Learning
- Fleuret, F. and Blanchard, G. - Pattern Recognition from One Example by Chopping
- A Probabilistic Approach for Optimizing Spectral Clustering - R. Jin ,C.Ding and F.Kang
- Keiji Miura, Masato Okada, Shun-ichi Amari - Unbiased Estimator of Shape Parameter for Spiking Irregularities under Changing Environments
- Blatt D. and Hero A. O., "From weighted classification to policy search"
- A Connectionist Model for Constructive Modal Reasoning. A.d'Avila Garcez. Luis C. Lamb and Dov Gabbay
- J. Ting, A. D'Souza, K. Yamamoto, T. Yoshioka, D. Hoffman, S. Kakei, L. Sergio, J. Kalaska, M. Kawato, P. Strick and S. Schaal. Predicting EMG Data from M1 Neurons with Variational Bayesian Least Squares.
- G. L. Li, T.-Y. Leong, Learning Causal Bayesian Network with Constraints from Domain Knowledge
- G. L. Li, T.-Y. Leong, Learning Bayesian Networks with Variable Grouping Methods.
- L. Itti and P. Baldi. "Bayesian Surprise Attracts Human Attention"
- Q-Clustering M. Narasimhan, N. Jojic and J. Bilmes
- Nando de Freitas, Yang Wang, Maryam Mahdaviani, Dustin Lang. Fast Krylov Methods for N-Body Learning.
- Consistency of One-Class SVM and Related Algorithms, R. Vert and J.-P. Vert
- A. J. Bell and L. C. Parra, "Maximising sensitivity in a spiking network,"
- R. S. Zemel, Q. J. M. Huys, R. Natarajan, and P. Dayan, "Probabilistic computation
- in spiking populations," in Advances in NIPS, 2005
- C. Yang, R. Duraiswami and L. Davis: "Efficient Kernel Machines Using the Improved Fast Gauss Transform", NIPS 2005
- F. Orabona. Object-based Model of the Visual Attention for Imitation
- Jieping Ye, Ravi Janardan, Qi Li. Two-Dimensional Linear Discriminant Analysis
- L. Song, E. Gordon, and E. Gysels, "Phase Synchrony Rates for the Recognition of Motor Imageries in BCIs"
- Kakade, S., M. Seeger and D. Foster: Worst-Case Bounds for Gaussian Process Models.
- Shen, Y., A. Ng and M. Seeger: Fast Gaussian Process Regression using KD-Trees.
- Kevin J. Lang - Fixing two weakness of the Spectral Method
- Kari Torkkola and Eugene Tuv. Ecumenical kernels from random forests
- C. Molter and U. Salihoglu and H. Bersini, Storing information through complex dynamics in Recurrent Neural Networks
- J. Ting, A. D'Souza, K. Yamamoto, T. Yoshioka, D. Hoffman, S. Kakei, L. Sergio, J. Kalaska, M. Kawato, P. Strick and S. Schaal. Predicting EMG Data from M1 Neurons with Variational Bayesian Least Squares
- Z. Nenadic, D.S. Rizzuto, R.A. Andersen, and J.W. Burdick, "Discriminat basedfeature selection with information theoretic objective
- Mooy, J., & Kappen, H.J. (2004). Validity estimates for loopy belief propagation on binary real-world networks
- Lackey, J. and Colagrosso, M (2005). A Kernel Method for Polychotomous Classification
- J. A. Palmer, K. Kreutz-Delgado, D. P. Wipf, and B. D. Rao, Variational EM Algorithms for Non-Gaussian Latent Variable Models
- Singh, S., Barto, A., and Chentanez, N. (2005). Intrinsically motivated reinforcement learning
- Yun-Gang Zhang, Changshui Zhang, Separation of Music Signals by Harmonic Structure Modeling.
- Rob Powers, Yoav Shoham, New Criteria and a New Algorithm for Learning in Multi-Agent Systems
- Wood, F., Roth, S., and Black, M. J., "Modeling neural population spiking activity with Gibbs distributions,
- Hinton, G. E. and Nair, V. Inferring motor programs from images of handwritten digits.
- Onureena Banerjee, Alexandre d'Aspremont, Laurent El Ghaoui - Sparse Covariance Selection via Robust Maximum Likelihood Estimation.
- "Measuring Shared Information and Coordinated Activity in Neuronal Networks." K. Klinkner, C. Shalizi, and M. Camperi, NIPS, 2005.
- Generalisation Error Bounds for Classifiers Trained with Interdependent Data Usunier N., Amini M.-R., Gallinari P
- Brafman, R. I. and Shani, G. "Resolving perceptual asliasing with noisy sensors"
- Griffiths, T.L., and Ghahramani, Z. (to appear) Infinite Latent Feature Models and the Indian Buffet Process
- Zhang, J., Ghahramani, Z. and Yang, Y. (to appear) Learning Multiple Related Tasks using Latent Independent Component Analysis.
- Murray, I., MacKay, D.J.C., Ghahramani, Z. and Skilling, J. (to appear) Nested Sampling for Potts Models.
- Snelson, E. and Ghahramani, Z. (to appear) Sparse Parametric Gaussian Processes
- Ghahramani, Z. and Heller, K.A. (to appear) Bayesian Sets
- On the Convergence of Eigenspaces in Kernel Principal Component Analysis: G. Blanchard and L. Zwald
- Poupart, P. and Boutilier, C. VDCBPI: an Approximate Scalable Algorithm for Large Scale POMDPs
- T. Murayama and P. Davis, Rate distortion codes for sensor networks: A system-level analysis,
- Damian Eads, Karen Glocer, Simon Perkins, and James Theiler (2005). Grammar-guided feature extraction for time series classification.
- Doi, E., Balcan, D. C., & Lewicki, M. S. A theoretical analysis of robust coding over noisy overcomplete channels
- Nadler, Boaz; Lafon, Stephane; Coifman, Ronald R.; Kevrekidis, Ioannis G. - Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck operators
- "Abstractions and Hierarchies for Learning and Planning" Lihong Li
- "Walk-sum interpretation and analysis of Gaussian belief propagation ," Jason K. Johnson, Dmitry M. Malioutov, and Alan S. Willsky
- Baker C., Tenenbaum J., & Saxe R. - Bayesian models of perceiving.intentional action
- Tensor Subspace Analysis - Xiaofei He, Deng Cai, and Partha Niyogi
- Laplacian Score for Feature Selection - Xiaofei He, Deng Cai, and Partha Niyogi.
- Shunji Satoh ``Long-Range Horizontal Connections in V1 Serve to Multi-Scale Image Reconstruction
- "Faster Rates in Regression via Active Learning" Rebecca Willett with R. Castro and R. Nowak,
- Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification - Ashish Kapoor, Yuan (Alan) Qi, Hyungil Ahn and Rosalind W. Picard
- Maxim Raginsky, Svetlana Lazebnik - Estimation of intrinsic dimensionality using high-rate vector quantization
- Ofer Dekel, Shai Shalev-Shwartz, Yoram Singer - The Forgetron: A Kernel-Based Perceptron on a Fixed Budget
- Patrick Flaherty, Michael Jordan, Adam Arkin - Robust Design of Biological Experiments
- Yael Niv, Nathaniel Daw, Peter Dayan - How Fast to Work: Response Vigor, Motivation and Tonic Dopamine
- Learning from Data of Variable Quality - Koby Crammer, Michael Kearns, and Jennifer Wortman
- R. Kondor and T. Jebara - Gaussian and Wishart Hyperkernels
- L. Liao, D. Fox, and H. Kautz. - Location-Based Activity Recognition Reinforcement Learning of Local Shape in the Game of Atari-Go David Silver, Richard Sutton, Martin Müller, Markus Enzenberger

## Sunday, October 16, 2005

### Commercial article databases and indexes

Here are the numbers of results for 3 queries, searching for exact phrase in title:

"exponential families":

Google Scholar -- 635

mathscinet -- 660

JSTOR -- 147

"support vector":

Google Scholar -- 3800

mathscinet -- 131

JSTOR -- 1

"cross validation":

Google Scholar -- 1530

MathSciNet -- 181

JSTOR -- 53

JSTOR has full texts for all articles it indexes. In my experience, MathSciNet had full text links to about half the articles. Google Scholar is probably lowest precision search because it includes duplicates and documents not published anywhere besides the Web.

Questions:

What are other good indexing services?

## Saturday, October 15, 2005

### Learning Mathematica

Mathematica Training -- Notebooks from a 2 day Mathematica course

NKS summer school -- some notebooks from the "New Kind of Science" summer school intro to Mathematica

Programming Paradigms via Mathematica - notebooks from a course developed by Neidinger and Swallow

## Thursday, March 10, 2005

### The joy of scanning

- "Algebra of Probable Inference" by Cox, 1961 (aka, Why everyone should be a Bayesian). Demonstrates a functional derivation of probability theory as the unique extension of Boolean Algebra.
- "Why I'm not a Bayesian" by Clark Glymour, Theory and Evidence, 1981. Criticizes Bayesian approach from the philosophy of science point of view.
- "Why Glymour is a Bayesian" by R Rosenktrantz, Testing Scientific Theories, 1983
- "Why isn't everyone a Bayesian?" by Efron B, American Statistician 1986. Examines reasons why not everybody was a Bayesian, as of 1986, with scorching reply from Lindley.
- "Axioms of Maximum Entropy" by Skilling, MaxEnt 1988 proceedings. Sets up four practically motivated axioms, and uses them to derive maximum entropy as the unique method for picking a single probability distribution from the set of valid probability distributions.

- They assume boolean logic
- Can encode their true belief as a pdf

- Believe Skilling's axioms.
- Have statements about true distribution in the form of constraints.

Questions

- The references are over 20 years old. One newer one from 2000 by Berger looks at rising popularity of "objective" Bayesian and robust Bayesian approaches, and predicts practical Bayesianism of the future to contain both frequentist and traditional Bayesian elements. Does anyone know of more up-to-date overviews of different inductive reasoning methods?

BTW, if you are the author, and don't like the links to your work, let me know, and I'll remove them

## Saturday, March 05, 2005

### More on CiteULike

The immediate advantage of CiteULike is that it can fill Bib details in for you. Here are some CiteULike usage tips

- Make sure you have "Post to Citeulike" button on toolbar
- When adding article X, first search for X on scholar.google.com
- If search results have a link to ACM, IEEE, JSTOR, Ingenta or some other supported collection you are in luck -- go there, click "Post to Citeulike" and it'll automatically extract the bibliographic information for you
- If you are adding books, search for book on Amazon, and do the same procedure
- For off-campus access, save electronic book/article copies to a web accessible directory, and add link to (it's only seen by you) to the book/article entry
- Add RSS of new submissions with relevant tags from CiteULike to your RSS aggregator
- Choose usernames such that people can find you through Google (ie, I use yaroslavvb, which google knows about)

- Any other efficiency tips?

## Friday, March 04, 2005

### Machine Learning journals

- Journal of Machine Learning Research (web, rss) The journal for machine learning publications. A better and freer replacement to "Machine Learning" journal -- here's some history
- IEEE Transactions on Pattern Analysis and Machine Intelligence (web, rss) -- pattern means "visual pattern"
- Machine Learning journal (web, rss) -- seems more applied than JMLR
- Studies in History and Philosophy of Science Part B: Modern Physics (web, rss) -- sometimes discussions of Bayesianism, Maxent
- Neural Computation(web, rss) -- neural nets, neurobiology, and general machine learning
- ARXIV submissions on Learning (rss) and on Information Theory (rss)
- Journal of the Royal Statistical Society (web, rss) -- some of the best papers on estimation were published here
- Physics Review Letters (web) -- sometimes Machine Learning related stuff
- The Annals of Statistics (web) -- Czisar published here
- SIAM applied math (web)
- Journal of Econometrics (web, rss) -- often the same problems but with a different name
- Proceedings of the Nat'l Academy of Sciences(web) -- sometimes have interesting papers in applied math/computer science/statistics sections

Watch out for information overload :O

Questions:

- What are some other journals that Machine Learning researchers should know about?

## Monday, February 28, 2005

### Blogs of Note

Technical

- Machine Learning -- Eric Altendorf is a grad student doing research on Knowledge Intensive learning here at OSU
- Machine Learning -- Josek is a PhD student working on Bayesian Learning
- Alpha and Beta -- (two stats grad students?)

Non-Technical

- BMI students -- a group of students in a graduate BioInformatics program
- Math4Poets -- math history
- Treadmill to Infinity -- AI researcher at MIT
- Sam Reid's blog -- grad student at U.Colorado, does Machine Learning research

Yin Zhangqi's Blog -- physics grad student. It's mostly in Chinese, but with google's translator, and good deal of imagination, you can read some things

## Saturday, February 26, 2005

### Knowledge Intensive Learning

## Friday, February 25, 2005

### Naive Bayes to fight HIV?

article

## Thursday, February 24, 2005

### How to browse the web while pretending to do Machine Learning research

- Browse ML-related tags on Del.icio.us (http://del.icio.us/tag/machinelearning)
- Find people who bookmarked JMLR and browse their bookmarks/inbox (ie http://del.icio.us/dnaboy76)
- Search for "Machine Learning" on technoratid (search)
- Find people who subscribed to John Langford's blog through bloglines, and browse their subscriptions, (ie http://bloglines.com/public/Suresh)
- Find people who entered ML papers in Citeulike, and browse their library (ie http://www.citeulike.org/user/brian)
- At some point, get back to work ;)

- What are some other ways of finding interesting stuff?

## Tuesday, February 22, 2005

### CiteULike

Resources:

- http://www.citeulike.org/
- CiteULike FAQ
- Papers I've added to it recently

## Thursday, February 17, 2005

### Machine Learning reading groups

This list is not very up to date. This is a permalink so feel free to link to it. Please let me know through email, or comment section about other up-to-date MLRGs you know (site must list papers)

Current MLRGs

(as of Feb 19th 2005)

- Oregon State U (Roweis, Tishby, Jordan, Collins, Caticha)-- mlrg
- Gatsby (McKay, ARD, LDA) -- mlrg
- Humboldt-Universitat zu Berlin (ICML, KDD papers)-- mlrg
- TTIC Chicago list of readings (Langford, Kakade, McAllester)seminars
- University of New Mexico (Jaakkola, Mooney, Scholkopf) -- mlrg
- Oregon Health and Science U. (Seeger, Geman, MacKay) -- mlrg
- Washington University (Weiss, Srebro) -- mlrg
- MIT (Neal, Hofmann, Tishby) -- mlrg
- University of California, Irvine (Blei, Jordan, Welling) -- mlrg
- University of Amsterdam (Manning, Koller, Collins, Valiant) mlrg
- Hebrew University (IB, SVM, Rademacher, Chechnik) mlrg
- Cornell (Jordan, Schuurmans,Cristianini) -- mlrg
- South Carolina (Kakade, Dresner, Feigenbaum) mlrg
- U. Alberta (Sutton, RL, MaxEnt)-- mlrg
- Weizemann Institute, (vision oriented, Jordan, Buntine) -- mlrg
- Helsinki U of Tech (mostly neuroscience, Sejnowski) -- mlrg
- University of London (vision oriented) -- mlrg
- Rutgers (Michael Jordan's notes) -- mlrg
- UMass (vision oriented, ICA, kernel PCA) -- mlrg
- Edinburgh (NLP group, Hoffman, Manning,Collins) -- mlrg
- U. of California, San Diego (Tishby, Lafferty, Joachims, Herbrich) -- mlrg
- UTexas (Aggarwal, Altun, Manning) -- mlrg

- Columbia (November) -- mlrg
- U.Toronto (July)-- mlrg
- Tufts University (December) -- mlrg
- Australian National U. (December)-- mlrg
- Centre Universitaire d'Informatique (June) --journal club
- Microsoft Research (Herbrich, large-margin, factor graphs, kernels) -- mlrg
- CWI Journal Club (Netherlands) journal club
- Seminars at NYU (Langford, Saul) seminars
- Graphical Models Reading Group at CMU (about 10 months old) mlrg

Who is being read?

Below is the list of most popular authors ranked by how many groups in the "current MLRGs" section had them listed:

9 - Koller

8 - Jordan

7 - Weiss

6 - Taskar, Saul, McCallum, Lafferty, Ghahramani, Friedman, Collins

5 - Roweis, Pereira, Ng, Domingos

4 - Zhu, Tishby, Russell, Roth, Li, Kearns, Jaakkola, Guestrin, Blei

3 - Zhou, Yair, Wong, Wellner, Tenenbaum, Slonim, Singh, Singer, Scholkopf, Schapire, Mooney, Mansour, Manning, MacKay, Joachims, Hofmann, Griffiths, Freund, Freeman, Dengyong, Breiman, Bartlett, Altun

Questions:

- What other reading groups have websites?
- What are other ways of finding interesting papers?

## Saturday, February 12, 2005

### Tensor Methods for Optimization

Description

Main idea to Schnabel's tensor method is to select a set of p points from previous evaluations such that they are mostly linearly independent (each new direction makes at least a 45 degree angle with old direction). The idea is then to try to interpolate function values at those points by fitting a 4th order model with special form of 3rd and 4th order terms. Because of the restricted form of these terms, the overhead associated with this step is not significant over the standard Newton-Gauss algorithm. Because it approximates function with 4th order expansion, the number of problems for which this method converges was larger. Schnabel also proves superlinear convergence of this method for some cases when Newton's method converges linearly.

Performance

Seth Greenblatt tried tensor method on some econometric models, and results were 1.5-3 times reduction in number of function evaluations. He compared it against Newton-Rhapson method, using exact matrix inversion (O(n^3)) and the extra overhead of for the tensor method wasn't significant. The problems he tested on were econometric models including Lucas-Rapping model and Berndt-Wood KLEM model. Lucas-Rapping model in his thesis was basically logistic regression with some econometric semantics.

Schnabel's paper in SIAM J Opt also used explicit Hessian, and O(n^3) steps to invert it. His 1998 paper introduces implementation of tensor method, TenSolve, which uses same approach as Gauss-Newton to approximate second order term from gradients, which means O(n^2) space and time requirement. Asymptotic overhead of using tensors was O(n^1.5)

Thoughts

TenSolve stores the approximation to the Hessian explicitly, so that takes O(n^2) space. It seems conceptually possible to reduce storage to O(n) if we use a method analogous to limited memory BFGS to store the Hessian. If that worked, it would result in a tensor optimization method with O(n) storage complexity and O(n^1.5) computation complexity

- What are some applications of Machine Learning where an O(n^1.5) method would be acceptable?
- What would be a good dataset on which to test CRF training using this method?

Resources:

- Schnabel, Robert B and Ta-Tung Chow (1991) "Tensor Methods for Unconstrained Optimization Using Second Derivatives," SAIM J. Optimization 1,3,293-315
- Ali Bouaricha, Robert B. Schnabel: Algorithm 768: TENSOLVE: A Software Package for Solving Systems of Nonlinear Equations and Nonlinear Least-Squares Problems Using Tensor Methods. ACM Trans. Math. Softw. 23 (2): 174-195 (1997)
- Greenblatt, "Tensor Methods for Full-Information Maximum Likelihood Estimation", PhD thesis, George Washington University
- TenSolve (Implementation of Schnabel's tensor optimization method): http://www.netlib.org/toms/768

## Friday, February 11, 2005

### Uninformative Priors

This prior seems to favour values of theta close to 0 or 1, so how can it be uninformative?

The intuition they provide is the following: think of Maximum Likelihood estimator as the most uninformative estimate of parameters. Then we can rank informativeness of priors by how closely the Bayesian Posterior Mean estimate comes to the maximum likelihood estimate.

If we observed k heads out of n tosses, we get following estimates:

1. MLE: k/n

2. PM using Jeffrey's prior (k+1/2)/(n+1)

3. Laplace smoothing (Uniform) (k+1)/(n+2)

4. PM using Beta(a,b) prior (k+a)/(n+a+b)

You can see that using this informal criteria for informativeness, Jeffreys' prior is more uninformative than Uniform, and Beta(0,0) (the limiting distribution) is the least informative of all, because then Posterior Mean estimate will coincide with the MLE.

Resources:

- Zhu, Lu, The Counter-intuitive Non-informative Prior for the Bernoulli Family, Journal of Statistics Education Volume 12, Number 2 (2004), http://www.amstat.org/publications/jse/v12n2/zhu.pdf
- Kass, Wasserman, "The Selection of Prior Distributions by Formal Rules", JASA 96
- Anne Randi Syversveen. "Noninformative Bayesian priors. Interpretation and problems with construction and applications" (1998), unpublished http://www.math.ntnu.no/preprint/statistics/1998/S3-1998.ps

## Wednesday, February 09, 2005

### Why I started this blog

- Comments on papers
- Mini-tutorials on ML-related topics
- Links to online Machine Learning resources

It won't be linear in nature as I'll revise old posts when relevant information comes up. Maybe in the future I'll switch to a setup similar to Cosma Shalizi's notebooks, but at this point, blog is the most practical thing to do.

- Cosma Shalizi's notebooks http://www.cscs.umich.edu/~crshalizi/notebooks/

## Monday, February 07, 2005

### Lagrange multipliers: equality constraints

- Finding lengths of best instantaneous code when distribution of source is known
- Finding maximum entropy distribution subject to given constraints
- Minimizing distortion specified by the amount of information X provides about Y (Tishby's, Information Bottleneck method)

However, you don't have to understand them to be able to go through the math. Here's how you can do it in 5 easy steps:

Write down the thing you are optimizing as J(x)

- Write down your constraints as g_i(x)=0
- Define L(x,l1,l2,...)=J(x) + l1 g1(x) + l2 g2(x) +...
- Differentiate L with respect to each variable and set to 0
- Solve this system of equations

Suppose you want to maximize

subject to

Your problem looks like this:

(black line is where you solution must lie)Then our only constraint function is g(x,y)=x-y+1=0. Our L (Langrangian) then becomes

differentiating and setting to 0 we get

Solve this system of 3 equations by substitution and you get x=-0.5, y=0.5, lambda=-1. The lagrange multiplier lambda_i represents ratio of the gradients of objective function J and i'th constraint function g_i at the solution point (that makes sense because they point in the same direction)

Bigger Example

For a more realistic example, consider maximizing entropy over a set of discrete distributions over positive integers where the first moment is fixed. Since we have an infinite number of possible X's, differentiating the Lagrangian gives us an countably infinite system of equations. But that's not actually a problem (unless you try to write them all down). Here's the detailed derivation

http://yaroslavvb.com/research/reports/fuzzy/fuzzy_lagrange.ps

So you can see that highest entropy discrete distribution with fixed mean is geometric distribution. We can solve the same problem for continuous X's in the positive domain, and get exponential distribution. If we allow X's to span whole reals, there will be no solution for the same reason as why "uniform distribution over all reals" doesn't exist.

Questions:

- What to use for complicated 3d graphs? I tried Mathematica, but not sure how to combine two 3d graphs into one
- Can anyone do Exercise 1. in Chapter 5 of Christianini's SVM book?
- If we use Robinson's non-standard Analysis, could we sensibly define "uniform distribution over all integers"?

Resources:

- Dan Klein's "Lagrange Multipliers without Permanent Scarring" http://www-diglib.stanford.edu/~klein/lagrange-multipliers.pdf
- Chapter 5 of N. Cristianini and J. Shawe-Taylor Support-Vector Machines book http://www.support-vector.net/chapter_5.html
- Chapter 5 of Stephen Boyd and Lieven Vandenberghe "Convex Optimization" http://www.stanford.edu/~boyd/cvxbook.html
- Bertsekas, Dimitri P. "Constrained optimization and Lagrange multiplier methods" ISBN: 0120934809

## Saturday, February 05, 2005

### Interesting blogs, newsgroups, chatrooms

## Technical:

- John Langford's blog -- http://hunch.net/ , high quality technical posts on Machine Learning
- Lance Fortnow's blog -- http://weblog.fortnow.com/ , Computational Complexity and related
- Notebooks -- http://www.cscs.umich.edu/~crshalizi/notebooks/ , a collection of resources from Cosma Shalizi
- Illinois Genetic Algorithms Laboratory (IlliGAL) -- http://illigal.blogspot.com/
- AI-complete -- http://ai-complete.blogspot.com/ low-traffic, but high quality AI blog
- Neurodudes -- http://www.neurodudes.com/ Neuro-biological standpoint
- Statistical Modeling, Causal Inference, and Social Science -- http://www.stat.columbia.edu/~cook/movabletype/mlm/

## non-Technical:

- AI Knowledge (http://www.karmachakra.com/aiknowledge) -- non-technical AI-related
- Ernie's 3D Pancakes -- http://3dpancakes.typepad.com/ernie/ Jeff is a "computational geometer"
- "Occasional thoughts by physicist Michael Nielsen" -- http://www.qinfo.org/people/nielsen/blog/
- Ferndando Pereira's blog -- http://radio.weblogs.com/0100167/
- Language log -- http://itre.cis.upenn.edu/~myl/languagelog/
- Hanna Wallach is doing a PhD in Machine Learning -- http://www.srcf.ucam.org/~hmw26/join-the-dots/
- The Alpha and Omega -- http://mahalanobis.twoday.net/ interesting snippets from an economist
- Thesilog -- http://www.sologen.net/thesilog/ Amir does research in control theory

## Newsgroups:

- comp.lang.ai
- comp.lang.ai.neural-nets
- sci.stat.math
- Machine Learning (http://groups-beta.google.com/group/Machine-Learning)

## Mailing Lists:

- Connectionists (calls for papers, jobs): http://www.cnbc.cmu.edu/Resources/connectionists.shtml
- COLT mailing list (more calls for papers): http://mail.cs.uiuc.edu/mailman/listinfo/colt
- UAI mailing list (if you can get on it): http://www.auai.org/

## Chatrooms:

- irc.freenode.org #MachineLearning

## Finding Blogs:

After looking at "Referring Link" stats of people browsing this blog, I discovered two new ways of finding relevant blogs- Go to http://blo.gs and search for "Machine Learning"
- Go to http://www.technorati.com/, search for Machine Learning related links, and it'll give you other blogs that have that link

Feel free to post other cool blogs, or other methods of finding them

## Thursday, February 03, 2005

### Getting probabilities from students

Suppose you give true/false test to students and want them to put down their subjective probabilities that answer is true/false. To make them honest, you would want to award points in such a way, that in order to maximize their expected points, the student would need to report their correct probabilities. Which utility function to use? It turns out, f(x)=A log x + B is one such family. Here's the derivation

Questions:

1) Does any other such utility function exist?

2) What are other fun ways to torture students?

Here are graphs of three possible functions that can be used for eliciting probabilities from students. They are all concave, maybe one can prove any suitable function must be concave?