When prior knowledge is given in the form of a prior, Bayesian method is the rational way of making decisions, based on

Cox' arguments. However, prior knowledge doesn't usually come in a form of a prior. For instance an expert may say that X and Y are independent, or that increasing value of X will influence Y to increase. We need to figure out how to use those kinds of knowledge to bias our learners, and this is where Knowledge Intensive Learning comes in. Eric Altendorf does research in the area and is doing a series of posts on it in his new

blog.

## 7 comments:

You can represent background knowledge that relates to the model structure via hyperpriors.

If you can construct those priors, then Bayesian approach is the way to go...but how do you construct those hyperpriors to start with? IE, suppose you are a programmer that knows nothing about the data domain. You are supposed to program the prior, and are able to ask the domain expert any questions you want. That's a realistic situation because machine learning experts are usually not experts in biological/medical sciences where the datasets often come from.

So the following issues arise:

1. Which questions do you ask?

2. How do you go from question-answer pairs to a real valued function over the space of probability distributions?

I think the second one is especially problematic. We can elicit human beliefs over a discrete set in a principled way using betting strategy like Tverski & Kahneman did. However, what approach do you take when your space is uncountable, or even, in case of continuous data domains, uncountable and infinite dimensional?

Exactly. The hard bit of almost every successful machine learning

applicationis the mapping from high-level domain knowledge to the nuts and bolts of feature selection, parameterization, prior selection, etc.That's where most of the work lies, and also where the key to success usually lies.

Consider an expert statement like "The virus incubates for about 2 weeks before the mosquito becomes infectious, at which point it will start to bite people, though keep in mind it only feeds maybe once a day, and it prefers small birds." How do you go from that to a set of features and priors? We want to make that task easier, since a lot of the time it's so hard that a lot of domain knowledge is ignored.

Ooops. Sorry, my example wasn't quite clear. I meant to say the mosquito "prefers to bite small birds rather than humans when possible". This is a real example; you're safer when the targets per mosquito ratio is high, and even more so when the ratio of small bird targets to human targets is high.

Choosing priors and hyperpriors is all the Bayesian flavor of machine learning is about. It's all meta-learning in the sense that we try to characterize a particular bunch of data sets, and hope that the same methods will apply to other data sets. Good/widely-applicable methods get to stay around and evolve, narrow or inferior methods wilt away and perish.

If you take comfort in mathematics, you can assume what the data is and then prove theorems. But that's just for comfort, and it doesn't comfort everyone.

IMHO, the point of proving theorems about learning methods is *not* to show that they will work well in practice (only experiments can do that). The point of proving theorems is that they help you develop good intuition about what might work.

For instance COLT-92 saw the introduction of SVM, a practical method, which grew out of mathematical theorems proven by Vapnik and Chernovenkis.

Proving math theorems allows one to be rigorous about their intuition. It also allows one to share their intuition with others. This is one reason to why there's mild contempt for "ad hockery" -- if one doesn't adequately explain the reasons behind design decisions (ie by invoking some theoretical framework) people can't extend their method.

Post a Comment