Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Does your model understand genes? A benchmark of gene properties for biological and text models

Dec 05, 2024

Yoav Kan-Tor, Michael Morris Danziger, Eden Zohar, Matan Ninio, Yishai Shimoni

Figure 1 for Does your model understand genes? A benchmark of gene properties for biological and text models

Figure 2 for Does your model understand genes? A benchmark of gene properties for biological and text models

Figure 3 for Does your model understand genes? A benchmark of gene properties for biological and text models

Figure 4 for Does your model understand genes? A benchmark of gene properties for biological and text models

Share this with someone who'll enjoy it:

Abstract:The application of deep learning methods, particularly foundation models, in biological research has surged in recent years. These models can be text-based or trained on underlying biological data, especially omics data of various types. However, comparing the performance of these models consistently has proven to be a challenge due to differences in training data and downstream tasks. To tackle this problem, we developed an architecture-agnostic benchmarking approach that, instead of evaluating the models directly, leverages entity representation vectors from each model and trains simple predictive models for each benchmarking task. This ensures that all types of models are evaluated using the same input and output types. Here we focus on gene properties collected from professionally curated bioinformatics databases. These gene properties are categorized into five major groups: genomic properties, regulatory functions, localization, biological processes, and protein properties. Overall, we define hundreds of tasks based on these databases, which include binary, multi-label, and multi-class classification tasks. We apply these benchmark tasks to evaluate expression-based models, large language models, protein language models, DNA-based models, and traditional baselines. Our findings suggest that text-based models and protein language models generally outperform expression-based models in genomic properties and regulatory functions tasks, whereas expression-based models demonstrate superior performance in localization tasks. These results should aid in the development of more informed artificial intelligence strategies for biological understanding and therapeutic discovery. To ensure the reproducibility and transparency of our findings, we have made the source code and benchmark data publicly accessible for further investigation and expansion at github.com/BiomedSciAI/gene-benchmark.

View paper on

Share this with someone who'll enjoy it:

Title:Does your model understand genes? A benchmark of gene properties for biological and text models

Paper and Code