One approach to reducing the computational complexity of protein structure comparisons is to borrow from sequence comparison tools like BLAST. Sequence comparison tools find the optimal alignment between two or more strings of letters, with each string representing the sequence of a protein and each letter representing an amino acid. This process is much less computationally demanding than the structural alignments performed by tools like TM-align.
Foldseek, developed in 2023, turns the three-dimensional structural alignment in to a two-dimensional sequence alignment problem. As mentioned above, Foldseek uses a “3Di” alphabet to represent the 3D structure of a protein. In this case, each string represents the structure of a protein and each letter represents the conformation of adjacent amino acids. Foldseek then uses a sequence comparison algorithm (a modified version of MMseqs2) to align the 3Di sequences. Foldseek is reportedly 88% as sensitive as TM-align but four to five orders of magnitude faster.
Another modern approach to identifying homologous proteins is the application of Natural Language Processing (NLP), a type of artificial intelligence that processes and extracts meaning from language. Protein sequences are well-suited for NLP because they are analogous to written language in many ways: amino acids are represented by letters, motifs/domains are similar to words, and the entire sequence can be thought of as a sentence. Large language models applied to proteins are referred to as protein language models. One such model, Protein structure-sequence T5 (ProstT5), bridges protein sequence and structure by translating a protein’s amino acid sequence to a 3Di sequence (and vice versa).
Deep learning is another field of artificial intelligence that can be applied to protein structural alignment. For example, the Sequence Alignments from deep-Learning of Structural Alignments (SAdLSA) tool, developed in 2021, is trained on the alignments of tens of thousands of experimentally-determined protein structures. After learning to recognize patterns in this data, SAdLSA can predict the structural alignment of proteins based solely on their amino acid sequences.
As mentioned above, applying machine learning tools developed for other fields often requires converting protein structural data into more generic data types, such as graphs. The Graph-based protein Structure Representation (GraSR) tool, developed in 2022, constructs a graph of protein structure based on the coordinates of the alpha carbon in each amino acid. GraSR uses multiple neural networks to learn the geometric features of a protein and can compare proteins structures without the need for alignment.
Computer vision can also be used to compare protein structures. The Protein Cavity Registration (ProCare) tool, developed in 2020, takes a computer vision-based approach to identify similarities in the shapes of potential ligand binding sites on protein structures. ProCare uses a technique called point cloud registration to align and compare protein cavities in order to identify potential drug targets.
As powerful as these new tools are, they should also be used with caution. For example, tools that predict protein structures are generally trained on biological datasets and may not accurately predict the structures of engineered proteins. Tools that use predicted protein structures assume these structures are correct; this is not always the case.
The approaches discussed here generally compare the entire structures of monomeric proteins. Tools that identify proteins with partial structural similarity — e.g. proteins with shared domains—also have the potential to inform our understanding of protein function. As our ability to predict the structures of protein complexes improves, our need for tools that can compare the structures of protein complexes will also increase.
Just as organisms evolve from pre-existing organisms, proteins tend to evolve from pre-existing proteins. In theory, all (or nearly all) natural proteins should share at least partial homology with other proteins. The ability to identify homologous proteins throughout the tree of life would enable us to identify corresponding molecular pathways in different species, as well as pathways within an organism that have evolved through gene duplication. An improved ability to predict protein function would accelerate the study of diseases and potential treatments. The best method for comparing protein structures is still an area of active research, but the creative application of artificial intelligence methodologies developed for other fields is already yielding promising results.