Žiga Avsec of Google Deepmind: AI allows us to grasp the complexity of the genome
Žiga Avsec is the head of a genomics research team at Google DeepMind, Google's research unit for artificial intelligence development. In an interview with the Slovenian Press Agency he discussed his work at Google DeepMind, genome research and other fields where artificial intelligence systems help foster scientific progress.
What excites you the most in artificial intelligence, what are you eagerly awaiting?
I would divide AI into two types. One is similar to human characteristics, such as large language models, the other type are more special models. We can expect that large language models will help us every day, for example in interpreting literature etc. What excites me about the more special models is that with their help we will be able to translate large amounts of data that will be produced in future experiments into something very useful and help us make big progress in science.
In the team that you lead at DeepMind you are working on understanding the human genome. Which know-how and education are necessary for your work? Is your group very interdisciplinary?
Yes, our team is very interdisciplinary because we have to first understand the biology of the problem, which requires specific knowledge, then we have to understand what kinds of experiments were used to measure processes, how the data is processed, how models learn, and finally how to communicate that to the broader scientific audience.
We are lucky that we can work with a diverse group of experts, from engineers that help us process large amounts of data, to data scientists that analyse the data, biology experts, and machine learning and AI researchers that can train the models well and optimise them for use on accelerators. Our work is very interdisciplinary and that makes it especially interesting.
How large is the group that you lead?
(Laughter) I am not allowed to say that, but I can say that at DeepMind we have projects in different phases, from projects that have only two people to those where 20 or more people are working on them, such as AlphaFold. So we have projects at different levels of maturity and size.
Why are AI systems useful for genome research? What advantages do they have for researchers?
The genetic code is a language that evolution programmed for millions of different animal species through billions of years and it is the language that drives life on Earth. But it was written using the highly random process of evolution, during which mutations and natural selection take place. This means that the code is relatively complex.
You can imagine the genome as a text which contains sentences with combinations of words. It has a meaning but it is difficult to extract it using traditional algorithms. Because of its complexity, the number of combinations and nuances is very high. Artificial intelligence allows us to encompass the complex characteristics of the genome with a model by learning to recognize these nuanced patterns from a large amount of data.
Which techniques are you comparing your work with or 'competing with' in strengthening the genome research?
I would not really say that we are competing but rather that we are complementing existing techniques. For example, more classical statistical models assume a simpler model ahead of time based on existing knowledge, which makes them easier to interpret. However, in case the problem is very complex or our knowledge incomplete, these statistical models are often too simplistic to capture all the nuance of the biological system. AI algorithms allow us to model more complex phenomena. However, we then have to pay a slightly higher price when interpreting them as the models are more complex and more difficult to interpret.
What would you note as an achievement of your team so far and what are you expecting to see in the future?
We have had quite a lot of progress. Our goal in an abstract sense is to better understand the genome and to make significant progress in science.
One of the models that we developed is called Enformer. It takes a long DNA sequence as input - 200,000 base pairs - and it predicts different characteristics of the sequence, such as which part of the sequence is open in different cell types or which genes are expressed in different cell types. This model then, for example, allows us to study the effects of mutations on these processes. We know that mutations that are in the non-coding part of the genome can, for example, cause a disease by changing the expression of a gene so that it is not expressed.
We hope that models like the Enformer will be useful for the interpretation of mutations and for studying evolution by identifying different regulatory elements in other animal species. Another achievement I would like to point out is our recently developed predictive model AlphaMissense, but more about that later.
So you could say that you are developing the technology and using it, applying it at the same time?
Usually we first set a metric of success, meaning what we want the model to do. In machine learning we usually want the model to accurately predict data using information it has not seen before. When planning the project, we need to think about what this metric of success should be and how researchers are going to use it in practical applications. During the project we track the accuracy of predictions. At Google DeepMind we usually strive for major progress and often keep improving the model for a while before leveraging it for downstream problems.
For example, on AlphaFord a very large team worked for years before they shared the predicted structures with the general public and the scientific community that then used these structures for various applications.
How many and which programming languages are you using in the team?
We are lucky nowadays that the tools are very good and allow us to write code at a higher level. We write a lot of code in Python, which is slower as a language. It is quite some levels away from the code that runs on accelerators. However, compilers allow us to translate this code into code that is then run on accelerators such as graphic cards or Tensor processing units. So yes, we do most of our work in Python and we use systems such as TensorFlow or JAX that allow us to perform matrix operations with efficient operations compiled for accelerators.
As you mentioned, Google DeepMind has different projects and teams - how do internal communication and exchange of knowledge and experience work? How much of the knowledge is transferable?
In some projects some technology can be directly used in a different project. One such project was AlphaTensor. In other projects like AlphaFold only certain components of models can be used, such as transformers. The advantage is that we can ask the people that are very familiar with these techniques for tricks on how to best train these models and how to resolve potential issues ... So yes, even if the technology is not directly applicable, some knowledge can still be reused. It is important to note that when technology is applied to something new, such as AlphaFold or genomics, a lot of work is still needed before it works well. People often think that we can just take a technique off the shelf and apply it to a new domain, but it is rarely so simple.
Do you face ethical dilemmas in the course of your work and how do you address them?
We face ethical questions and actively address them because we have to be very responsible. A discussion on such issues is an integral part of the research process.
It is great that we have an excellent ethics team at DeepMind that can contact us or we can contact them at different phases of the project. At the start where we are for example discussing different data options, or at the end when we are, for example, deciding how to share the model with the general public. AlphaFold is a good example of an extended debate on how to share the model with the public. They brought in external experts from different fields that gave an opinion on whether or not all protein structure predictions should be open to everyone or not.
We always try to think about how our work could potentially be used, both in good or bad ways and if something can be done to prevent the use of the models in a negative way. So yes, there are ethical questions that we are actively considering and addressing to pioneer responsibly.
You have mentioned AlphaFold several times. We know it as a system that can predict protein structures based on their amino acids sequences. Why is the ability to predict protein structures important?
Proteins are fundamental building blocks of cells that play a very important role. The structure of proteins determines their function, so it is very important to structural biologists to know the protein structure.
The structure can be experimentally determined with crystallography or cryo-electron microscopy. But the process of determining structures is very slow, it can take several years, and it is expensive. With a model like AlphaFold we can get a similarly accurate structure much faster, in a matter of minutes.
How about translating these findings into, for example, medicine and pharmaceutics?
These structures are very useful, which is evident from the number of users of the AlphaFold database. I believe that more than a million scientists are using it, a quarter of them are researching diseases. It is not something directly useful, we have to understand these structures as a tool or some kind of additional information that can help scientists find the answers they are looking for.
For some of the problems this is the last missing piece of a mosaic, while others require a lot of experimental work before the mosaic is assembled. People have used these structures in malaria drug development, in researching antibiotics resistances, and in finding genes that are related to diseases... The applications are quite broad.
Does this have the potential to help in the development of medicines that are relatively neglected due to low profitability?
Yes, one of DeepMind's early collaborations was with the DNDi initiative - Drugs for neglected diseases initiative. Since these organisations do not have the financial means to determine protein structures in the usual way, AlphaFold allows them to skip that phase and discover treatments for diseases that others are not investing in.
As a group leader, you recently published a paper on the latest model named AlphaMissense and based on AlphaFold, in the prestigious journal Science. It is a tool that researchers can use to study the effects of genetic mutations that change the amino acid sequence of proteins and their potential to cause disease. What knowledge gaps are you filling with this and why is it important in medicine?
Of the 71 million possible genome mutations that change one amino acid in a protein by a single-letter difference in DNA, only 0.1% have so far been clinically identified and classified as pathogenic or harmless. Predictive models such as AlphaMissense can predict the vast majority of them. For example, AlphaMissense predicted 89% of possible mutations as either potentially pathogenic or harmless with approximately 90% accuracy. These predictions can help clinical geneticists to find the one mutation that caused a rare inherited disease among many other harmless mutations. Discovering this cause may help to choose the right therapy or lead to the development of drugs.
What are the main differences between AlphaMissense and AlphaFold from a technical and methodological point of view, was the modification or development a challenging process?
We added a couple of extra parameters to the AlphaFold model to capture mutations and added an extra 'output' to predict pathogenicity. The parameters of this model were then slightly adjusted so that the model could still predict the protein structures well, but it also started to better distinguish between common mutations in human or primate populations (these are treated as non-damaging), and mutations that have not yet been measured in human populations (these are treated as pathogenic). This development has been somewhat challenging: on the one hand, we were able to start with an excellent model for predicting structures, which made our job much easier, but on the other hand, many improvements had to be made at various points during the process to make the predictions accurate.
Could you, to round off our conversation, give us a few other example of applications of deep learning models for problem solving in scientific research that you are doing as part of DeepMind?
We use deep learning in most of the problems we are solving. Artificial intelligence can help mathematicians discover patterns or connections between complex structures that are difficult to spot with the naked eye. The models can also be used in quantum chemistry. Colleagues are using these models to predict the energy of molecules and describe the distribution of electrons within a molecule in order to solve the Schrödinger equation.
Another application is controlling plasma in a fusion reactor by appropriately steering the magnets. Very interesting application is also forecasting the weather a few hours in advance. So yeah, it is quite fascinating to see all the progress in the field of science (laughter).
The interview was originally published on the STA's science portal at znanost.sta.si