Predicting gene expression with AI

Based on Transformers, our new Enformer architecture advances genetic research by improving the ability to predict how DNA sequence influences gene expression.

When the Human Genome Project succeeded in mapping the DNA sequence of the human genome, the international research community were excited by the opportunity to better understand the genetic instructions that influence human health and development. DNA carries the genetic information that determines everything from eye colour to susceptibility to certain diseases and disorders. The roughly 20,000 sections of DNA in the human body known as genes contain instructions about the amino acid sequence of proteins, which perform numerous essential functions in our cells. Yet these genes make up less than 2% of the genome. The remaining base pairs — which account for 98% of the 3 billion “letters” in the genome — are called “non-coding” and contain less well-understood instructions about when and where genes should be produced or expressed in the human body. At DeepMind, we believe that AI can unlock a deeper understanding of such complex domains, accelerating scientific progress and offering potential benefits to human health.

Today Nature Methods published “Effective gene expression prediction from sequence by integrating long-range interactions” (first shared as a preprint on bioRxiv), in which we — in collaboration with our Alphabet colleagues at Calico — introduce a neural network architecture called Enformer that led to greatly increased accuracy in predicting gene expression from DNA sequence. To advance further study of gene regulation and causal factors in diseases, we also made our model and its initial predictions of common genetic variants openly available here.

Previous work on gene expression has typically used convolutional neural networks as fundamental building blocks, but their limitations in modelling the influence of distal enhancers on gene expression have hindered their accuracy and application. Our initial explorations relied on Basenji2, which could predict regulatory activity from relatively long DNA sequences of 40,000 base pairs. Motivated by this work and the knowledge that regulatory DNA elements can influence expression at greater distances, we saw the need for a fundamental architectural change to capture long sequences.

We developed a new model based on Transformers, common in natural language processing, to make use of self-attention mechanisms that could integrate much greater DNA context. Because Transformers are ideal for looking at long passages of text, we adapted them to “read” vastly extended DNA sequences. By effectively processing sequences to consider interactions at distances that are more than 5 times (i.e., 200,000 base pairs) the length of previous methods, our architecture can model the influence of important regulatory elements called enhancers on gene expression from further away within the DNA sequence.

Enformer is trained to predict functional genomic data including gene expression from 200,000 base pairs of input DNA. The example above features three out of more than 5,000 possible genomic tracks. By using transformer modules, which gather information across the whole sequence using attention, we are able to effectively consider much longer input sequences compared to previous models.

To better understand how Enformer interprets the DNA sequence to arrive at more accurate predictions, we used contribution scores to highlight which parts of the input sequence were most influential for the prediction. Matching the biological intuition, we observed that the model paid attention to enhancers even if located more than 50,000 base pairs away from the gene. Predicting which enhancers regulate which genes remains a major unsolved problem in genomics, so we were pleased to see the contribution scores of Enformer perform comparably with existing methods developed specifically for this task (using experimental data as input). Enformer also learned about insulator elements, which separate two independently regulated regions of DNA.

Enformer attends to relevant regulatory DNA regions (shown in blue) called enhancers (grey boxes) even at distances beyond 20,000 base pairs away from the gene thanks to a more expansive receptive field.

Although it’s now possible to study an organism's DNA in its entirety, complex experiments are required to understand the genome. Despite an enormous experimental effort, the vast majority of the DNA control over gene expression remains a mystery. With AI, we can explore new possibilities for finding patterns in the genome and provide mechanistic hypotheses about sequence changes. Similar to a spell checker, Enformer partially understands the vocabulary of the DNA sequence and can thereby highlight edits that could lead to altered gene expression.

The main application of this new model is to predict which changes to the DNA letters, also called genetic variants, will alter the expression of the gene. Compared to previous models, Enformer is significantly more accurate at predicting the effects of variants on gene expression, both in the case of natural genetic variants and synthetic variants that alter important regulatory sequences. This property is useful for interpreting the growing number of disease-associated variants obtained by genome-wide association studies. Variants associated with complex genetic diseases are predominantly located in the non-coding region of the genome, likely causing disease by altering gene expression. But due to inherent correlations among variants, many of these disease-associated variants are only spuriously correlated rather than causative. Computational tools can now help distinguish the true associations from false positives.

The variant rs11644125, located in the immune response gene NLRC5, is associated with lower levels of monocyte and lymphocyte white blood cells. By systematically mutating every position surrounding the variant and predicting the resulting change on NLRC5 gene expression (shown as letter height), we observed that the variant leads to an overall lower expression of NLRC5 and modulates the known binding motif of a transcription factor called SP1. Hence, Enformer predictions suggest that the biological mechanism behind this variant’s effect on white blood cell counts is lower NLRC5 gene expression due to perturbed SP1 binding.

We’re far from solving the untold puzzles that remain in the human genome, but Enformer is a step forward in understanding the complexity of genomic sequences. If you’re interested in using AI to explore how fundamental cell processes work, how they’re encoded in the DNA sequence, and how to build new systems to advance genomics and our understanding of disease, we’re hiring. We’re also looking forward to expanding our collaborations with other researchers and organisations eager to explore computational models to help solve the open questions at the heart of genomics.