Basecamp Research says its software platform can better predict the way proteins behave — not with better algorithms, but with higher-quality data.
In 2018, Google’s AI lab Deepmind released an algorithm that took the biology world by storm. Called AlphaFold, the software was able to accurately predict protein structures — a complex problem that was heralded as a major scientific breakthrough. Understanding how proteins interact is key to understanding everything in biotech from how to make food taste better to how to make crops survive climate change to curing cancer. Since its release, AlphaFold, its successor AlphaFold2 and the hundreds of millions of protein structures it has generated over the past few years have become a key part of the toolkit of biotech researchers around the world.
But while AlphaFold has helped propel the industry forward, it has its own set of limitations. Researchers are still a long way from the Holy Grail of synthetic biology: where an AI model can take a desired protein shape and figure out how to create it by either finding the right chemical to interact with it or wholly designing a protein found nowhere in nature.
Today, scientists at London-based Basecamp Research announced that they’re a step closer to that goal thanks to a new AI model built on top of AlphaFold2’s open-source algorithms. Basecamp says its model, BaseFold, which is trained on a much broader dataset, can produce more accurate protein structure predictions than AlphaFold2. The company also announced it would be working with Nvidia to optimize BaseFold for use with the chip giant’s generative AI platform for drug discovery, BioNeMo.
Glen Gowers, Basecamp’s cofounder and CEO, claims that its software produces a threefold improvement in predicting how protein structures will change when they interact with small molecules, which is a key datapoint in the drug discovery process. The company published a paper reporting its results, which have not yet been peer-reviewed, on the preprint server bioRxiv. To date, it has raised a total of $25 million in capital and has a $71 million valuation, according to Pitchbook.
While this is a major milestone for the four-year-old startup, Gowers, 29, believes the software takes him a step closer to his ultimate goal: being able to design proteins – or even new organisms – to meet his customers’ needs. “We're not looking to be only a protein structure company,” he told Forbes. “We're broadly applying this across any generative or predictive task. So things like protein-function adaptation, generation of new proteins – even generation of new genomes.”
Gowers got the idea for Basecamp in 2019, when he and some fellow researchers spent a month in Iceland living off the grid. They spent their days sequencing the genomes of a special set of microorganisms that had evolved to survive both extreme heat and cold because they lived near both ice and a hot spring. The majority of the data his team gathered in one month was “entirely dark matter of unknown proteins, unknown sequences of unknown origin,” he said. That data helped him realize that the publicly available genomic datasets that AlphaFold has been trained on are “the equivalent of about five drops of water worth of species relative to the Atlantic Ocean’s worth known to exist.”
The sheer volume of data on proteins matters when predicting how these building blocks of life will fold because there are so many variables that can determine how they act — so many that it’s nearly impossible to compute directly because the math is so complicated. But if a machine learning model is trained on billions of different structures, patterns emerge that enable it to predict with better accuracy how a given protein will fold.
Think of it like the AI chatbots that have come on the scene in the past few years. Train a bot on a small subset of human language–like, say, Twitter–and you will discover as Microsoft did in 2016 that it becomes a raving lunatic. ChatGPT and its competitors, by contrast, are trained on much bigger and diverse parts of the internet, resulting in bots that produce better results to questions and are less likely to insult you. In the same way, collecting a much bigger, diverse set of genomic data makes for better predictions of how proteins will fold.
That’s why Basecamp has been working to diversify the protein dataset that its models are trained on. Since its founding in 2020, Basecamp has been working with researchers around the world to sequence high-quality genomic information from tens of millions of microbes, plants and animals from around the world. Those researchers, in turn, are paid royalties from revenue generated by Basecamp for the data.
Along with sequencing the DNA of these organisms the researchers collect contextual information as well, providing even more data that the AI can use to help inform why proteins fold the way they do. “With every entry in our base, we collect hundreds of extra dimensions,” said the company’s CTO Phillip Lorenz, 31. This includes local temperatures, pH, salinity of water organisms were found in, how much light is available to those organisms and more. The geography that these samples are found in is also incredibly diverse, he added, from caves in Hungary to deep sea ocean vents. “We go to all biomes across the world, from volcanic islands to the Antarctic.”
Basecamp is already generating revenue, Gowers told Forbes, by using its predictive modeling to solve customer problems (he declined to share figures). For example, it’s working with U.K.-based Colorfix to design new proteins that can be used to dye fabrics without using harsh chemicals. It’s also helping Connecticut-based startup Protein Evolution to discover new proteins that can break down plastics so they can be recycled. In addition, Gowers hopes to use its computational chops to develop new drugs in collaboration with pharmaceutical companies.
That said, Gowers admits that the company can’t stay scrappy forever. In order to compete with better capitalized rivals, Basecamp plans on raising more investment in the near future. “Training new models and building new architectures, particularly when your data is extremely large, is an extremely expensive business,” he said.
Sourced from Forbes