Computational predictions of protein structures associated with COVID-19

The scientific community has galvanised in response to the recent COVID-19 outbreak, building on decades of basic research characterising this virus family. Labs at the forefront of the outbreak response shared genomes of the virus in open access databases, which enabled researchers to rapidly develop tests for this novel pathogen. Other labs have shared experimentally-determined and computationally-predicted structures of some of the viral proteins, and still others have shared epidemiological data. We hope to contribute to the scientific effort using the latest version of our AlphaFold system by releasing structure predictions of several under-studied proteins associated with SARS-CoV-2, the virus that causes COVID-19. We emphasise that these structure predictions have not been experimentally verified, but hope they may contribute to the scientific community’s interrogation of how the virus functions, and serve as a hypothesis generation platform for future experimental work in developing therapeutics. We’re indebted to the work of many other labs: this work wouldn’t be possible without the efforts of researchers across the globe who have responded to the COVID-19 outbreak with incredible agility.

Knowing a protein’s structure provides an important resource for understanding how it functions, but experiments to determine the structure can take months or longer, and some prove to be intractable. For this reason, researchers have been developing computational methods to predict protein structure from the amino acid sequence.  In cases where the structure of a similar protein has already been experimentally determined, algorithms based on “template modelling” are able to provide accurate predictions of the protein structure. AlphaFold, our recently published deep learning system, focuses on predicting protein structure accurately when no structures of similar proteins are available, called “free modelling”.  We’ve continued to improve these methods since that publication and want to provide the most useful predictions, so we’re sharing predicted structures for some of the proteins in SARS-CoV-2 generated using our newly-developed methods.

It’s important to note that our structure prediction system is still in development and we can’t be certain of the accuracy of the structures we are providing, although we are confident that the system is more accurate than our earlier CASP13 system. We confirmed that our system provided an accurate prediction for the experimentally determined SARS-CoV-2 spike protein structure shared in the Protein Data Bank, and this gave us confidence that our model predictions on other proteins may be useful. We recently shared our results with several colleagues at the Francis Crick Institute in the UK, including structural biologists and virologists, who encouraged us to release our structures to the general scientific community now. Our models include per-residue confidence scores to help indicate which parts of the structure are more likely to be correct. We have only provided predictions for proteins which lack suitable templates or are otherwise difficult for template modeling.  While these understudied proteins are not the main focus of current therapeutic efforts, they may add to researchers’ understanding of SARS-CoV-2.  

Normally we’d wait to publish this work until it had been peer-reviewed for an academic journal. However, given the seriousness and time-sensitivity of the situation, we’re releasing the predicted structures as we have them now, under an open license so that anyone can make use of them.  

Interested researchers can read more technical details about these predictions in a document included with the data. The protein structure predictions we're releasing are for SARS-CoV-2 membrane protein, protein 3a, Nsp2, Nsp4, Nsp6, and Papain-like proteinase (C terminal domain). To emphasise, these are predicted structures which have not been experimentally verified. Work on the system continues for us, and we hope to share more about it in due course.

Update (August 4, 2020)

As we continue to improve our AlphaFold system, we’re releasing our most up-to-date predictions of five understudied SARS-CoV-2 targets here (including SARS-CoV-2 membrane protein, Nsp2, Nsp4, Nsp6, and Papain-like proteinase (C terminal domain)). 

We’ve previously shared predictions on this website, as well as on the CASP_Commons site, a collaborative effort of members of the CASP (Critical Assessment of Structure Prediction) community. CASP_Commons encourages research groups to share structure predictions for proteins with high biological significance. This spring, they collected predictions for a number of SARS-CoV-2 proteins, and we submitted several models for the 5 listed above, plus ORF3a. 

On June 17, an experimental structure of the ORF3a protein (Protein 3a) from SARS-CoV-2 was deposited in PDB by members of the Brohawn lab at UC Berkeley.  This protein forms an ion channel, and is very challenging for structure prediction due to the small number of related sequences available. It also has a novel fold not previously represented in PDB.  Our primary model for this protein had mostly correct topology but did not place the transmembrane helices or some parts of the extracellular domain correctly.  Our second most likely model (pictured below, submitted to CASP_Commons in early April) is in very good agreement with the later experimental work. 

We have improved our computational methods since April, and our latest models consistently place the better structure as the most likely prediction.  For this reason, we have decided to release a new set of predictions on the remaining 5 proteins that have not been experimentally determined.

The experimental paper confirmed several aspects of our model that at first seemed surprising to us (e.g. C133 looked poorly placed to form an inter-chain disulfide, and we found it difficult to see how our prediction would form a C4 tetramer). This bolsters our original hope that it might be possible to draw biologically relevant conclusions from AlphaFold’s blind prediction of even very difficult proteins, and thereby deepen our understanding of understudied biological systems.

The most up-to-date structure predictions, version 3, can be downloaded here (please use these).

You can find version 2 of the predictions, posted on April 8, here.

You can find the original version of the predictions, posted on March 4, here.

Citation:  John Jumper, Kathryn Tunyasuvunakool, Pushmeet Kohli, Demis Hassabis, and the AlphaFold Team, “Computational predictions of protein structures associated with COVID-19”, Version 3, DeepMind website, 4 August 2020, https://deepmind.com/research/open-source/computational-predictions-of-protein-structures-associated-with-COVID-19

The best CASP_commons prediction by AlphaFold is indicated in blue, and the experimental structure is in green.

OpenSource

04 Aug 2020