The Case for Cortical Computation as MCMC

Epistemic status: Exploratory; low confidence. This is just a few informal notes on an idea whose plausibility was surprising to me.

Computational neuroscientists don’t agree on much, but there’s a common view among them that the cortex does some form of probabilistic inference.

When you dart your eyes across the room, your brain updates what you see within milliseconds. This feat raises a question: What method of probabilistic inference does the brain use to achieve these rapid updates of what you see? Variational inference? Belief propagation? Stein variational gradient descent? Probabilistic population codes? Something else?

Here I’m going to walk through the arguments that the cortex might use Markov Chain Monte Carlo (MCMC). It’s an idea that, on first glance, seemed to me totally wrong; I shared the widespread view that MCMC is too slow to perform such rapid probabilistic updates. Yet over time, I came round to the view that it’s actually moderately plausible. I’m still not condident that the brain really implements MCMC; this article just contains the essential arguments for why I no longer think that it’s an obviously-wrong-idea.

The slow, slow Metropolis-Hastings algorithm

The first MCMC algorithm was the Metropolis-Hastings algorithm. The essential idea of the Metropolis-Hastings algorithm (and indeed all MCMC algorithms) is that if we want to sample from a probability distribution in an \(N\)-dimensional space, we start with a ‘particle’ at a random location in the space then propose a new, nearby position for the particle in a random direction. Then, we accept or reject the proposal position; crucially, we’re more likely to accept the new proposal if its new position is higher probabiltiy than the current position. The result is that the particle randomly diffuses around the \(N\)-dimensional space with a slight bias toward high probabilty regions. It can be shown that, after a large number of timesteps, the particle visits each point in the space in proportion to the probability density that we’re trying to sample from.

Let’s not mince our words here: The Metropolis-Hastings algorithm is god-awfully slow, especially in high dimensions. This is because in high-dimensional spaces almost every direction is orthogonal to the directions of increased probability density! So most proposals are rejected and therefore diffusion toward higher probability regions of the space is prohibitively slow.

Me waiting on my MCMC runs

No one seriously believes that the cortex could be using a version of the Metropolis-Hastings algorithm. Fortunately, there are far more efficient MCMC algorithms available. We’ll consider two classes here: Langevin MCMC and Hamiltonian Monte Carlo.

Langevin MCMC: Go with the Gradient

We can immediately do better than Metropolis-Hastings algorithm if we have access to the gradient of the probability density function: if we know the gradient, we just have to move the particle, \(x_t\) in the direction of the gradient so that it moves toward high-probability regions of the space. We also need to add the right amount of noise so that the particle doesn’t simply settle at the highest probability point of the distribution; instead it settles in an equilibrium distribution. This approach is broadly called Langevin MCMC. It’s appropriate for high dimensional densities since far more of its proposals are in the direction of higher probability, and hence far fewer proposals are rejected. It thus overcomes the slow exploration of the Metropolis-Hastings algorithm. In Langevin MCMC, the dynamics of a particle \(x\) as it moves around the \(N\)-dimensional space can be described by the following update:

\[x_{t+1} \leftarrow x_t - \frac{\epsilon}{2}\nabla_{x_t} E(x_t) + \eta_t \qquad \eta_t \sim \mathcal{N}(0,\epsilon)\]

where \(E\) is the energy function for that distribution. An energy function is just a function that takes a point in an \(N\)-dimensional space as input and spits out a single number that says how probable its input is. It’s therefore an un-normalised description of a probability distribution, where the normalised distribution is:

\[p(x)=\frac{\exp{-E(x)}}{\int_{x'}\exp{-E(x')}}\]

If you only have an energy function, MCMC methods are handy because they allow you to draw samples from the distribution without knowing the ugly intractable integral in the denominator (the ‘normalising constant’).

This brings us back to cortex: One possibility is that the cortex implements an energy function. How could the cortex represent an energy function and how could it possibly have access to its gradient?

(Some) Recurrent Neural Networks implement Langevin MCMC

‘Sampling models’ of neural activity usually begin by representing the instantaneous firing rates of \(N\) neurons as a point particle \(x_t\) an \(N\)-dimensional space. If those neurons are connected together as recurrent neural network (RNN), the particle’s position (neural activity) at one timestep determines the position at the next timestep (i.e. it has the Markov property). On small-enough timescales, the particle makes only small movements through neural activity-space, because cortical acitvity changes with continuous time. If we also accept the common view that cortical dynamics are noisy, these dynamics are already beginning to sound a lot like MCMC.

Bengio and Fischer (2015) formalised a possible link between certain recurrent neural networks and Langevin MCMC:

They considered energy functions with a particular form:

\[E(x)=\frac{1}{2} x^{\top}x + \frac{1}{2}\rho(x)^{\top}W\rho(x) + b^{\top} \rho(x)\]

where \(x\) is a vector of neural firing rates, \(\rho\) is a hard-sigmoid activation function (applied elementwise), \(W\) is a symmetric matrix of connectivity weights, and \(b\) is a vector of biases for each neuron.

Consider sampling from this energy function using Langevin dynamics. The dynamics are*

\(x_{t+1} \leftarrow x_t - \frac{\epsilon}{2}\nabla_{x_t} E(x_t) + \eta_t \qquad\qquad\qquad \eta_t \sim \mathcal{N}(0,\epsilon)\) \(x_{t+1} \leftarrow x_t - \frac{\epsilon}{2}\rho'(x_t) \left( W\rho(x_t) + b \right) + \eta_t \qquad \eta_t \sim \mathcal{N}(0,\epsilon)\)

(*See Bengio and Fischer (2015) for a fuller derivation. They actually arrive at a minorly different noise term, which I omit here for simplicity, but the difference is negligible with small \(\epsilon\).)

Intuitively, this looks much like the dynamics of a vanilla RNN: The weight matrix multiplied by the activity vector means that the dynamics of an individual neuron are determined by a weighted sum of its input neurons (and a bias term). We can condition the neural dynamics of such an RNN on external inputs by simply fixing the values of a subset of the neurons (input neurons) to whatever the inputs are. Conditioning the network on external inputs means that each input implies a different (conditional) probability distribution. New inputs can cause the particle to move to different parts of the space that are ‘high probability’ for that particular input. For example, if a trained network is conditioned on a picture of a cat, the network would assign high probability to neural activity vectors where the neurons corresponding to ‘cat’ representations are activated.

We’ve ended up with a simple RNN model starting from an energy function and Langevin dynamics. Notice how in this model, the weighted interaction terms determine the dependencies between variables, and the bias terms determine the baseline level they would decay to in the absence of inputs. Interestingly, this model (and other MCMC models of neural dynamics) imply that the noise that is observed ubiquitously in cortex has a function, rather than merely being a nuisance.

Implicit vs. explicit energy functions

We’ve just seen that Langevin sampling from a particular energy function implies an RNN with dynamics that visit certain high probability states more often than low probability states. But although we derived these dynamics from an energy function, we didn’t need to start with the energy function to derive the dynamics. We could have started with the RNN dynamics and arrived at the energy function. In fact, (and this is crucial) we need not necessarily arrive at the energy function explicitly and analytically - we can just take it for granted that RNNs with certain properties imply a certain energy function (e.g. properties such as being guaranteed or likely to end up in a stable attractor state) even though the actual energy function might be too difficult to derive analytically. Such an RNN would therefore imply a probability distribution, and its noisy dynamics would therefore implicitly correspond to Langevin MCMC. This is encouraging, but is far from a compelling account that the cortex actually is such a model. The example here was merely meant to show how Langevin sampling of a probability distribution could be implemented by the unrolling of a simple, noisy RNN.

Even faster sampling with gamma oscillations as Hamiltonian MCMC?

The brain is obviously not a simple RNN, but a very complex one. Langevin sampling dynamics probably can’t account for the many complexities of its dynamics. One such complexity is gamma oscillations, which are oscillations of cortical activity in the 40-60Hz frequency range. Although it is unclear how vanilla Langevin dynamics could account for this phenomenon, Máté Lengyel, Laurence Aitchison, and others have proposed that gamma oscillations may, in fact, be implementing a similar, yet even more efficient type of sampling: Hamiltonian MCMC.

In a nutshell, if Langevin sampling is like gradient descent, then Hamiltonian sampling is like gradient descent with momentum. Given a random initial location, momentum allows an \(N\)-dimensional particle to move to high probability regions of the distribution much more quickly than even Langevin methods. The connection to gamma oscillations is this: A marble rolling around a cereal bowl doesn’t stop at the bottom; it overshoots the bottom due to its momentum. In the same way, as a particle with momentum ‘rolls’ around valleys in the high dimensional energy landscape, it also exhibits oscilations. In this computional model of gamma oscillations, momentum in a particular direction corresponds to low activity of an inhibitory neuron, and the position of the particle corresponds to the activity of excitatory neurons. The key idea is that an inhibitory neuron with low activity permits the activities of excitatory neurons that it is connected to to increase quickly; the neural activity vector moves quickly in the uninhibited direction in neural activity-space, which is analagous to momentum in that direction.

Across several papers, their Hamiltonian sampling models of gamma oscillations offered an interesting normative theory for several phenomena that we observe in cortex, such as contrast-dependent oscillation frequency (the observation that higher contrast stimuli lead to higher frequency gamma oscillations (Ray and Maunsell (2010))) and contrast-dependent transient overshoots (the obseration that the amplitude of transient activity in response to stimuli increases with increasing stimulus contrast). It’s really nice theoretical work: I recommend their papers: Aitchison and Lengyel (2016) and Echeveste et al. (2020).

A brute-force solution to the slowness of MCMC?

Despite hints of validity, the perspective that cortical dynamics implement some form of MCMC has a long way to go before the evidence compels us to accept it.

A reasonable reader might still object to the idea that the cortex is doing MCMC because even Hamiltonian MCMC is just so much slower than many other methods of probabilistic sampling. I know this as well as anyone: In my own experiments with networks that implemented Hamitlonian sampling, the networks took over a month and a half to train, and training was still incomplete!

Nevertheless, this objection no longer carries as much weight for me as it used to: If MCMC can be implemented by the unrolling of an RNN, then specialised hardware (such as biological nervous tissue) could speed up this process terrifically. Consider that if we want to unroll a simple digital RNN with 1 million neurons: that implies a connectivity matrix with 1 million \(\times\) 1 million parameters. We’re already talking about a model with a trillion parameters, and vertbrate brains typically have many more than 1 million neurons. Multiplying this matrix by the activity vector quickly becomes prohibitively costly; it scales poorly in digital hardware.

But biological RNNs are specialised hardware - to perform the same operation (multiplying the connectivity matrix by the activity vector) we just let a biological RNN ‘do its thing’ and unroll; the physics of the hardware make unrolling the cortex-RNN a low-cost computation. It can do the equivalent of ‘multiplying the connectivity matrix by the activity vector’ continuously, many times a second, for arbitrarily large RNNs. Correspondingly, even though cortex contains many neurons and therefore should take a very long time to reach a stable attractor, there is evidence that sensory cortex finds unique stable attractors in high-dimensional neural state space very quickly - apparently in less than a second in the case of your visual cortex while it updates what you see. So, sure, there are more efficient probabilistic sampling methods out there than MCMC in theory. But in practice, it might just be easy to implement MCMC using biological neurons. And, if it does, it would hardly be the first time that Nature brute-forced a solution to a problem.

Conclusion

To summarise everything so far: Some RNNs imply an energy function of a probability distribution and their noisy dynamics implicitly sample from that distribution; the excitatory-inhibitory dynamics of gamma oscillations may implement even more efficient sampling; and, while MCMC is a less efficient sampling method than some others, it’s plausible that the cortex is just a brute-force solution to the problem that takes advantage of specially adapted hardware.

These arguments have moved me in the direction of thinking that cortical computation might do something like MCMC, but I don’t think it’s the Grand Unified Theory of cortex (and, as far as I know, no one else does either). I currently think the most reasonable objection is this: While all this is excellent theory, the cortex may use too many other computational strategies to relate every detail to MCMC. And that’s okay. MCMC doesn’t have to explain everything about the cortex; it’s just a theoretical lens through which to explain several aspects of it under one theory. I think that’s the most we can hope for in computational neuroscience.

Written on September 14, 2021