DeepMind has given us the 3-D structures of all human proteins. What does it mean for biofacturing?
Here’s the story of a true triumph in tech and bio.
DeepMind — a British AI research lab owned by Google parent company Alphabet — has created machine learning software called AlphaFold that solves a 50-year-old problem in biology: predicting the 3-D structure of a protein based on its sequence. Last week, it open-sourced its software and shared with the world a database of 350,000 predicted structures for the most useful proteins in science, including all human proteins.
In biology, there’s an old axiom that “structure is function” — a protein’s shape is the first important indicator of what it actually does. For decades, researchers have used slow, costly lab approaches like x-ray crystallography to see a protein’s shape. Now, researchers can pretty much “google” it. Some structures that have eluded researchers for years are now at their fingertips, opening new doors in medicine, bioengineering, and research. The implications for human disease alone — including Covid — are huge.
But you don’t have to be in the pharma business to be excited about the DeepMind news this week. In addition to human proteins, DeepMind also shared the protein structures of 20 model microbes, including workhorse organisms like E. coli and yeast. Biofacturing companies like Zymergen use these microbes to make products with biology. The DeepMind data on them is expected to help create microbes that better produce breakthrough chemicals and materials.
“The DeepMind library could help the industry take a big step forward,” says Zymergen CTO Aaron Kimball, “whether that’s to make new biosynthetic pathways for chemicals and materials, or to make novel proteins for human therapeutic applications.”
Machine learning is making its mark — sometimes in strange places
DeepMind’s software, named AlphaFold, predicts protein structures using a neural network — a mathematical system that can learn tasks by analyzing vast amounts of data. Neural networks are great at applications that depend on sequential data, and many of us use neural networks every day for things like digital voice commands, image searching, and natural language translation.
Given the strength of neural networks at tackling language, it might not come as a surprise that neural networks are also good at working with DNA sequence data. Researchers gave AlphaFold a training set of thousands of lab-proven protein structures, along with their related sequences. Using this data, the machine developed a set of rules that describe how sequence maps to structure. The researchers then asked AlphaFold to use those rules to extrapolate the 3-D structures of many other proteins based on their DNA sequences alone.
Every other year, the who’s who of the protein folding community comes together for CASP (Critical Assessment of Structure Prediction), which lets researchers compare their algorithms in an apples-to-apples environment over the course of a few weeks. At the 2020 CASP, AlphaFold was head-and-shoulders above the competition on just about every protein. In fact, it predicted structures that were in many cases hard to distinguish from the gold-standard structures determined using painstaking, low-throughput lab methods. The results were so stupendous, it left some experts wondering: is it even worth doing CASP again?
How protein structure prediction has improved since AlphaFold came along. A score of around 90 GDT is generally considered to be comparable with results obtained from experimental methods. DeepMind
Two examples of protein targets that AlphaFold predicted with high accuracy compared to lab experimental results. DeepMind
“Just five years ago, many people thought they wouldn’t see a solution to the protein folding problem in their lifetimes,” says Kurt Thorn, Senior Director of Data Science at Zymergen. He thinks that AlphaFold is the most visible and perhaps most helpful demonstration of what machine learning can do for biology right now, and yet it’s just the beginning.
“When AlphaFold debuted in 2018,” Thorn says, “it wasn’t as good as it is now. But it caused a whole community to sit up and take notice and say, ‘This is a game-changer.’ It was achieving structure prediction that nobody had ever been able to achieve, and it was doing it in a totally different way than anybody else had ever done it.”
Thorn says AlphaFold jump-started an entire field of work, not just in academia but also in places like Facebook, Amazon, Apple, and Twitter. He says that even Salesforce is applying AI to biology, having published about its own machine learning tool called ProGen for designing novel enzymes from scratch. AlphaFold, ProGen and several other examples all rely on deep learning neural network algorithms, although they explore a range of internal model architectures. Some of these are inspired by natural language approaches, where amino acids are like words in a paragraph, and the paragraph is like a protein.
Thorn says that what could end up being a game-changer for companies like Zymergen may not be the AlphaFold tool per se, but some other new advancement it enables, such as helping researchers predict not just structure but the actual function of the protein.
What comes next for machine learning and biology
Thorn and Kimball point out that, until now, machine learning has focused on properties like protein structure using readily available, large open-source datasets. But really, it’s the function of the protein that matters most.
Thorn explains: “When we’re designing an enzymatic pathway for a new product, we don’t say, ‘We need an enzyme that looks just like this.’ We say, ‘We need an enzyme that makes this product efficiently.’ Historically, structure has been a good way to learn about function, but we care a lot more about a protein’s actual function than its structure.”
“Functional prediction remains elusive at any scale,” adds Kimball. He and others at Zymergen are eager to push the machine learning envelope in this area. They want to be able to “google” not only the predicted function of a given protein but also what proteins carry out a given function and at what efficiency level. Zymergen’s own metagenomic database is an obvious place to apply such tools.
“Zymergen’s metagenomic database is a valuable source of training data and further understanding,” says Kimball. “Through our own work and strategic acquisitions of enEvolv and Lodo Therapeutics, Zymergen has created an incredible metagenomic database, and by applying structure prediction and other tools to it, we will illuminate new functional possibilities that Nature has created over billions of years.”
“Technology developed by groups like DeepMind has the potential to create a new era in natural product research,” says Brad Hover, R&D Director for Zymergen Exploration & Discovery. “Enzyme structural predictions not only help us understand how metagenomically-derived natural products are being made but also guide us in finding their microbial or human protein targets.”
A machine learning playground
For aspiring data scientists, biologists, and others, it’s a great time to be making things with biology. Lab automation, cloud computing, and machine learning are growing in power and affordability. Increasing data volumes are being met with better tools for predicting and harnessing gene function. And entirely new professions are being forged as biologists now work closely with engineers, software developers, and data scientists to master this technological fusion.
At places like Zymergen, it’s the perfect opportunity for like-minded makers and thinkers to put the best data and tools together and see what we can do at the edge of what’s possible.
“We are at the beginning of a revolution in biology,” says Kimball.