Conformal Prediction Proteins Header
News

Breakthrough Method Enables Rapid Discovery of New Useful Proteins

By Andy Murdock

A new computational approach developed at the IGI lets scientists quickly search massive datasets to prioritize which proteins to study.

AlphaFold2, the ground-breaking AI tool that can accurately predict protein structures, has predicted over 214 million structures to date. For biologists, this is a tantalizing gold mine of data. But it’s also a problem: how do you find the needle you’re looking for in such an immense haystack? Specifically, how can researchers find proteins with useful functions, like new gene editors or therapeutics, hidden among billions of proteins?

In a new paper published in Nature Communications entitled “Functional protein mining with conformal guarantees,” author Ron Boger with colleagues Seyone Chithrananda and Peter Yoon in Jennifer Doudna’s lab at the Innovative Genomics Institute and Anastasios Angelopoulos in Michael Jordan’s lab in UC Berkeley’s Department of Electrical Engineering & Computer Science unveil a practical new method that allows researchers to quickly narrow down large datasets of proteins to prioritize what to study in the lab.

Ron Boger and Seyone Chithrananda working in the Doudna lab at the IGI
Ron Boger and Seyone Chithrananda working in the Doudna lab at the IGI

Over the past few years as AI and machine learning have become more and more integral to life science research, labs have adapted, adding more computational resources and bringing in researchers with expertise beyond traditional “wet lab” techniques. Boger, currently a third-year Ph.D. student in the Doudna lab, previously worked at Google X and in the biotech sector leading machine learning teams before joining the IGI.

“When I joined the lab, scientists had found evolutionary variants of Cas proteins for gene editing and were eager to expand the search to the full range of endonucleases,” says Boger. “I realized my prior machine learning experience could help push beyond traditional homology approaches and discover new endonucleases by using AlphaFold-like strategies.”

He was not alone: the research space started to get crowded with competing tools leveraging protein foundation models for characterizing proteins and understanding their evolutionary histories. But these new models all had limits when it came to practical utility for researchers.

“Something that I kept running into when I was actually trying to use these methods to help discover new proteins was that it was not clear at all what was worthwhile prioritizing and characterizing in the lab,” says Boger. 

In one case, Boger and collaborators wanted to study the function of a set of novel viral genes, so they ran them through a recent model just released a week earlier that reported state-of-the-art results. The top result for each of the unknown viral genes had a score of 0.9999 — an almost perfect score. But when they looked more carefully at the data, everything had similarly high scores. It was like looking for a restaurant in an unfamiliar city and finding that every restaurant in the city had a 5-star review on Yelp. How would you know which one to pick?

“There wasn’t really a good sense of, okay, what is true and reliable here? Which things are really worthwhile to test?”

When is a squirrel actually a squirrel?

To solve this problem, Boger and colleagues looked to conformal prediction, a statistical approach used in AI and machine learning to provide a measure of uncertainty in a prediction. Co-author Anastasios Angelopoulos in the Jordan lab, a pioneer in conformal prediction methods and theory, met Boger by happenstance at a party where they realized they shared a common interest in solving problems like the one in this paper.

To explain how conformal prediction works, Angelopoulos uses an analogy with UC Berkeley’s favorite campus wildlife: the fox squirrel. If you take a photo of a squirrel and run that photo through a standard AI classifier to identify it, it will provide a single prediction: “fox squirrel.”  For a high-quality photo the model might be very confident, for a low-quality photo it will be less confident, but in either case it just tells you “fox squirrel.” Conformal prediction uses a portion of a model’s training data to calibrate how often the model makes mistakes. When mistakes are likely, instead of outputting a single response, the app could output a set of possibilities instead. For example, given a blurry photo of a squirrel, the classifier might output “fox squirrel, ground squirrel, or marmot with 95% confidence.” This gives the user a better understanding of the uncertainty, and the result adapts to the difficulty of the question.

Fox squirrel, ground squirrel, or marmot with 95% confidence. (Photo by Keegan Houser, UC Berkeley, with 100% confidence.)

The team’s insight was to apply this approach to the problem of discovering proteins. Using traditional structural alignment methods is not computationally possible with massive data sets, so this seemed like the perfect situation to use a machine learning model. To make it truly useful for researchers who need to choose proteins to study in the lab, Boger and team allowed researchers to set the level for how much uncertainty was acceptable for their experimental bandwidth.

“This paper is about a new method that helps biologists choose what to prioritize for study in the lab and have a statistical measure of certainty on protein function,” explains Boger.

What genes are necessary for life?

In 2016, J. Craig Venter and a team of researchers engineered a bacterium with the minimum viable genome, i.e., just the genes absolutely necessary for life. Despite its small genome, researchers didn’t actually know what function nearly 20 percent of the genes had. Earlier alignment methods couldn’t find any likely matches.

“With our method, we were able to label 40 percent of those unknown genes with a high degree of statistical certainty,” says Boger. “We had known these genes were there and that they were vital, but now we know their function. We can now annotate genes of previously unknown function using AI models with statistical certainty.”

Ron Boger Seyone Chithrananda and Peter Yoon
Co-authors Peter Yoon, Seyone Chithrananda, and Ron Boger (left to right)

In addition to this test case, the team used their new approach to improve a leading model’s ability to predict the functions of uncharacterized enzymes, and they also showed how their method can be used to quickly pre-filter huge databases to save computing time. This latter finding could have been a big help on another recent paper from the Doudna lab that Boger co-authored, which discovered a new CRISPR-associated protein using a combination of an AI model and long computational crunching on a UCSF supercomputer. 

“With this method, we would’ve been able to do those searches in two seconds without the need of a supercomputer,” says Boger, showing just how quickly the field is moving.

Next, the research team plans to use this method to start characterizing more genes of unknown function, and to search for new CRISPR-associated gene-editing proteins and other proteins with desirable properties that had previously been needles lost in an impossibly large haystack.


Read more: Boger RS, Chithrananda S, Angelopoulos AN, Yoon PH, Jordan MI, and Doudna JA. Functional protein mining with conformal guarantees. Nature Communications. 2025 Jan 2;16(1):85. https://doi.org/10.1038/s41467-024-55676-y

Media contact: Andy Murdock, andymurdock@berkeley.edu

Andy Murdock IGI By Andy Murdock

Andy Murdock is a science writer, evolutionary biologist, and Communications Director for the Innovative Genomics Institute. Before joining the IGI, Andy managed research communications for UC Office of the President, edited journals for Informa Life Sciences, and worked in the travel industry as Managing Editor for Airbnb and Digital Editor for Lonely Planet. Andy’s writing has appeared in Vox, BBC, Discovery, the Washington Post, the San Francisco Chronicle, and more. Andy has a Ph.D. in Integrative Biology from UC Berkeley, where he focused green plant phylogenetics, ancient fern lineages, and the evolution of plant genomes.