So you know the Bayes rule. How does it relate to machine learning? It can be quite difficult to grasp how the puzzle pieces fit together – we know it took us a while. This article is an introduction we wish we had back then.
While we have some grasp on the matter, we’re not experts, so the following might contain inaccuracies or even outright errors. Feel free to point them out, either in the comments or privately.
This is el drafto. Come back later for la version final.
Bayesians and Frequentists
In essence, Bayesian means probabilistic. The specific term exists because there are two approaches to probability. Bayesians think of it as a measure of belief, so that probability is subjective and refers to the future.
Frequentists have a different view: they use probability to refer to past events – in this way it’s objective and doesn’t depend on one’s beliefs. The name comes from the method – for example: we tossed a coin 100 times, it came up heads 53 times, so the frequency/probability of heads is 0.53.
For a thorough investigation of this topic and more, refer to Jake VanderPlas’ Frequentism and Bayesianism series of articles.
Priors, updates, and posteriors
We start with a belief, called a prior. Then we obtain some data and use it to update our belief. The outcome is called a posterior. Should we obtain even more data, the old posterior becomes a new prior and the cycle repeats.
This process employ the Bayes rule :
P( A | B ) = P( B | A ) * P( A ) / P( B )
P( A | B ) , read as “probability of A given B”, indicates a conditional probability: how likely is A if B happens.
Inferring model parameters from data
In Bayesian machine learning we use the Bayes rule to infer model parameters (theta) from data (D):
P( theta | D ) = P( D | theta ) * P( theta ) / P( data )
All components of this are probability distributions.
P( data ) is something we generally cannot compute, but since it’s just a normalizing constant, it doesn’t matter that much. When comparing models, we’re mainly interested in expressions containing theta, because
P( data ) stays the same for each model.
P( theta ) is a prior, or our belief of what the model parameters might be. Most often our opinion in this matter is rather vague and if we have enough data, we simply don’t care that much. Inference should converge to probable theta as long as it’s not zero in the prior. One specifies a prior in terms of a parametrized distribution – see Where priors come from .
P( D | theta ) is called likelihood of data given model parameters. The formula for likelihood is model-specific. People often use likelihood for evaluation of models: a model that gives higher likelihood to real data is better.
P( theta | D ) , a posterior, is what we’re after. It’s a probability over model parameters, including most likely point estimates, obtained from prior beliefs and data.
Note that choosing a model can be seen as separate from choosing model (hyper)parameters. In practice, though, they are usually performed together, by validation, for example.
Spectrum of methods
There are two main flavours of Bayesian. Let’s call the first statistic modelling and the second probabilistic machine learning. The latter contains the so-called nonparametric approaches.
Bayesian modelling is applied when data is scarce and precious and hard to obtain, for example in social sciences and other settings where it is difficult to conduct a large-scale controlled experiment. Imagine a statistician meticulously constructing and tweaking a model using what little data he has. In this setting you spare no effort to make the best use of available input.
Also, with small data it is important to quantify uncertainty and that’s precisely what Bayesian approach is good at.
Finally, as we’ll see later, Bayesian methods are usually computationally costly. This again goes hand-in-hand with small data.
To get a taste, consider examples for the Data Analysis Using Regression Analysis and Multilevel/Hierarchical Models book. That’s a whole book on linear models. They start with a bang: a linear model with no predictors, then go through a number of linear models with one predictor, two predictors, six predictors, up to eleven.
This labor-intensive mode goes against a current trend in machine learning to use data for a computer to learn automatically from it.
Probabilistic machine learning
Let’s try replacing “Bayesian” with “probabilistic”. From this perspective, it doesn’t differ as much from other methods. As far as classification goes, most classifiers are able to output probabilistic predictions. Even SVMs, which are sort of an antithesis of Bayesian.
By the way, these probabilities are only statements of belief from a classifier. Whether they correspond to real probabilities is another matter completely and it’s calledcalibration.
Still another thing are confidence intervals (error bars). You can observe this in regression. Most “normal” methods only provide point estimates. Bayesian methods, such as Bayesian version of linear regression, or Gaussian processes, also provide uncertainty estimates.
Unfortunately, it’s not the end of the story. Even a sophisticated method like GP normally operates on an assumption of homoscedasticity, that is, uniform noise levels. In reality, noise might be heteroscedastic. See the image below.
Latent Dirichlet Allocation is another example of a method that one throws data at and allows it to sort it out. It’s similar to matrix factorization models, especially non-negative MF. You start with a matrix where rows are documents, columns are words and each element is a count of a given word in a given document. LDA “factorizes” this matrix of size n x d into two matrices, documents/topics ( n x k ) and topics/words ( k x d ).
The difference is, you can’t multiply those two to get the original, but since the appropriate rows/columns sum to one, you can sample a document. For the first word, one samples a topic, then a word from this topic (the second matrix). Repeat for the number of words you want. Notice that this is a bag-of-words representation, not a proper sequence.
This is an example of a generative model, meaning that one can sample, or generate examples, from that model. Usually classifiers are discriminative: they model
P( y | x ) , to directly discriminate between classes based on x . A generative model is concerned with joint distribution of y and x ,
P( y, x ) . It’s more difficult to estimate that distribution, but it allows sampling and of course one can get
P( y | x ) from
P( y, x ) .
While there’s no exact definition, the name means that the number of parameters in a model can grow as more data become available. This is similar to Support Vector Machines, for example, where the algorithm chooses support vectors from the training points. Examples of nonparametrics are Gaussian Processes, and Hierarchical Dirichlet Process version of LDA, where the number of topics chooses itself automatically.
Gaussian processes are somewhat similar to SVM – both use kernels and have similar scalability (which has been vastly improved throught the years by using approximations). A natural formulation for GP is regression, with classification as an afterthought. For SVM it’s the other way around. Another difference is that GP are probabilistic from the ground up (providing error bars), while SVM are not.
Most of the research on GP seems to happen in Europe. English have done some interesting work on making GP easier to use. One of the projects is the automated statistician by a team led by Zoubin Ghahramani.
A relatively popular application of Gaussian Processes is hyperparameter optimization for machine learning algorithms. The data is small, both in dimensionality – usually only a few parameters to tweak, and in the number of examples. Each example represents one run of the target algorithm, which might take hours or days. Therefore we’d like to get to the good stuff with as few examples as possible.
Model vs inference
Inference refers to how you learn parameters of your model. A model is separate from how you train it, especially in the Bayesian world.
Consider deep learning: you can train a network using Adam, RMSProp or a number of other optimizers. However, they tend to be rather similar to each other, all being variants of Stochastic Gradient Descent. In contrast, Bayesian methods of inference differ from each other more profoundly.
Variational inference is a method designed explicitly to trade some accuracy for speed. It’s drawback is that it’s model-specific, but there’s light at the end of the tunnel – see the section on software below.
The most conspicuous piece of Bayesian software these days is probably Stan . Stan is a probabilistic programming language, meaning that it allows you to specify and train whatever Bayesian models you want. It runs in Python, R and other languages. Stan has a modern sampler called NUTS :
Most of the computation [in Stan] is done using Hamiltonian Monte Carlo. HMC requires some tuning, so Matt Hoffman up and wrote a new algorithm, Nuts (the “No-U-Turn Sampler”) which optimizes HMC adaptively. In many settings, Nuts is actually more computationally efficient than the optimal static HMC!
One especially interesting thing about Stan is that it has automatic variational inference :
Variational inference is a scalable technique for approximate Bayesian inference. Deriving variational inference algorithms requires tedious model-specific calculations; this makes it difficult to automate. We propose an automatic variational inference algorithm, automatic differentiation variational inference (ADVI). The user only provides a Bayesian model and a dataset; nothing else.
This technique paves way to applying small-style modelling to at least medium-sized data.
In Python, the most popular package is PyMC . It is not as advanced or polished (the developers seem to be playing catch-up with Stan), but still good. PyMC has NUTS and ADVI – here’s a notebook with a minibatch ADVI example . The software uses Theano as a backend, so it’s faster than pure Python.
Infer.NET is Microsoft’s library for probabilistic programming. It’s mainly available from languages like C# and F#, but apparently can also be called from .NET’s IronPython. Infer.net uses expectation propagation by default.
Besides those, there’s a myriad of packages implementing various flavours of Bayesian computing, from other probabilistic programming languages to specialized LDA implementations. One interesting example is CrossCat :
CrossCat is a domain-general, Bayesian method for analyzing high-dimensional data tables. CrossCat estimates the full joint distribution over the variables in the table from the data, via approximate inference in a hierarchical, nonparametric Bayesian model, and provides efficient samplers for every conditional distribution. CrossCat combines strengths of nonparametric mixture modeling and Bayesian network structure learning: it can model any joint distribution given enough data by positing latent variables, but also discovers independencies between the observable variables.
To solidify your understanding, you might go through Radford Neal’s tutorial on Bayesian Methods for Machine Learning . It corresponds 1:1 to the subject of this post.
We found Kruschke’s Doing Bayesian Data Analysis , known as the puppy book, most readable. The author goes to great lengths to explain all the ins and outs of modelling. In terms of machine learning it only covers linear models.
Similarly, Cam Davidson-Pylon’s Probabilistic Programming & Bayesian Methods for Hackers covers the Bayesian part, but not the machine learning part.
The same goes to Alex Etz’ series of articles on understanding Bayes .
For those mathematically inclined, Machine Learning: a Probabilistic Perspective by Kevin Murphy might be a good book to check out. You like hardcore? No problemo, Bishop’s Pattern Recognition and Machine Learning got you covered. One recent Reddit thread briefly discusses these two books.
As far as we know, there’s no MOOC on Bayesian machine learning, but mathematicalmonk explains machine learning from the Bayesian perspective.