Unsupervised deep learning identifies semantic disentanglement in single inferotemporal face patch neurons


In order to better understand how the brain perceives faces, it is important to know what objective drives learning in the ventral visual stream. To answer this question, we model neural responses to faces in the macaque inferotemporal (IT) cortex with a deep self-supervised generative model, β-VAE, which disentangles sensory data into interpretable latent factors, such as gender or age. Our results demonstrate a strong correspondence between the generative factors discovered by β-VAE and those coded by single IT neurons, beyond that found for the baselines, including the handcrafted state-of-the-art model of face perception, the Active Appearance Model, and deep classifiers. Moreover, β-VAE is able to reconstruct novel face images using signals from just a handful of cells. Together our results imply that optimising the disentangling objective leads to representations that closely resemble those in the IT cortex at the single unit level. This points to disentangling as a plausible learning objective for the visual brain.

Authors' Notes

Our brain has an amazing ability to process visual information. We can take one glance at a complex scene, and within milliseconds be able to parse it into objects and their attributes, like colour or size, and use this information to describe the scene in simple language. Underlying this seemingly effortless ability is a complex computation performed by our visual cortex, which involves taking millions of neural impulses transmitted from the retina and transforming them into a more meaningful form that can be mapped to the simple language description. In order to fully understand how this process works in the brain, we need to figure out both how the semantically meaningful information is represented in the firing of neurons at the end of the visual processing hierarchy, and how such a representation may be learnt from largely untaught experience.

Figure 1. Disentangling refers to the ability of neural networks to discover semantically meaningful attributes of images without being explicitly taught what these attributes are. These models learn by mapping images into a lower-dimensional representation through an inference neural network, and trying to reconstruct the image using a generation neural network. Each individual latent unit in a disentangled representation learns to encode a single interpretable attribute, like colour or size of an object. Manipulating such latents one at a time results in interpretable changes in the generated image reconstruction. Animation credit Chris Burgess.

To answer these questions in the context of face perception, we joined forces with our collaborators at Caltech (Doris Tsao) and the Chinese Academy of Science (Le Chang). We chose faces because they are well studied in the neuroscience community and are often seen as a “microcosm of object recognition”. In particular, we wanted to compare the responses of single cortical neurons in the face patches at the end of the visual processing hierarchy, recorded by our collaborators to a recently emerged class of so called  “disentangling” deep neural networks that, unlike the usual “black box” systems, explicitly aim to be interpretable to humans. A “disentangling” neural network learns to map complex images into a small number of internal neurons (called latent units), each one representing a single semantically meaningful attribute of the scene, like colour or size of an object (see Figure 1). Unlike the “black box” deep classifiers trained to recognise visual objects through a biologically unrealistic amount of external supervision, such disentangling models are trained without an external teaching signal using a self-supervised objective of reconstructing input images (generation in Figure 1) from their learnt latent representation (obtained through inference in Figure 1).

Disentangling was hypothesised to be important in the machine learning community almost ten years ago as an integral component for building more data-efficient, transferable, fair, and imaginative artificial intelligence systems. However, for years, building a model that can disentangle in practice has eluded the field. The first model able to do this successfully and robustly, called β-VAE, was developed by taking inspiration from neuroscience: β-VAE learns by predicting its own inputs; it requires similar visual experience for successful learning as that encountered by babies; and its learnt latent representation mirrors the properties known of the visual brain.

In our new paper, we measured the extent to which the disentangled units discovered by a β-VAE trained on a dataset of face images are similar to the responses of single neurons at the end of the visual processing recorded in primates looking at the same faces. The neural data was collected by our collaborators under rigorous oversight from the Caltech Institutional Animal Care and Use Committee. When we made the comparison, we found something surprising - it seemed like the handful of disentangled units discovered by β-VAE were behaving as if they were equivalent to a similarly sized subset of the real neurons. When we looked closer, we found a strong one-to-one mapping between the real neurons and the artificial ones (see Figure 2). This mapping was much stronger than that for alternative models, including the deep classifiers previously considered to be state of the art computational models of visual processing, or a hand-crafted model of face perception seen as the “gold standard” in the neuroscience community. Not only that, β-VAE units were encoding semantically meaningful information like age, gender, eye size, or the presence of a smile, enabling us to understand what attributes single neurons in the brain use to represent faces.

Figure 2. Single neurons in the primate face patches at the end of the visual processing hierarchy represent interpretable face attributes, like eye shape or the presence of a smile, and are equivalent to single artificial neurons in β-VAE discovered through disentangled representation learning. Image credit Marta Garnelo.

If β-VAE was indeed able to automatically discover artificial latent units that are equivalent to the real neurons in terms of how they respond to face images, then it should be possible to translate the activity of real neurons into their matched artificial counterparts, and use the generator (see Figure 1) of the trained β-VAE to visualise what faces the real neurons are representing. To test this, we presented the primates with new face images that the model has never experienced, and checked if we could render them using the β-VAE generator (see Figure 3). We found that this was indeed possible. Using the activity of as few as 12 neurons, we were able to generate face images that were more accurate reconstructions of the originals and of better visual quality than those produced by the alternative deep generative models. This is despite the fact that the alternative models are known to be better image generators than β-VAE in general.

Figure 3. Face images were accurately reconstructed by the trained β-VAE generator from the activity of 12 one-to-one matched neurons in the primate visual cortex as the primates were viewing novel faces. Novel face images reproduced with permission from Ma et al. and Phillips et al.

Our findings summarised in the new paper suggest that the visual brain can be understood at a single-neuron level, even at the end of its processing hierarchy. This is contrary to the common belief that semantically meaningful information is multiplexed between a large number of such neurons, each one remaining largely uninterpretable individually, not unlike how information is encoded across full layers of artificial neurons in deep classifiers. Not only that, our findings suggest that it is possible that the brain learns to support our effortless ability to do visual perception by optimising the disentanglement objective. While β-VAE was originally developed with inspiration from high-level neuroscience principles, the utility of disentangled representations for intelligent behaviour has so far been primarily demonstrated in the machine-learning community. In line with the rich history of mutually beneficial interactions between neuroscience and machine learning, we hope that the latest insights from machine learning may now feed back to the neuroscience community to investigate the merit of disentangled representations for supporting intelligence in biological systems, in particular as the basis for abstract reasoning, or generalisable and efficient task learning.