Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chengzhong Ye

A Phylogenetic Approach to Genomic Language Modeling

Mar 04, 2025

Carlos Albors, Jianan Canal Li, Gonzalo Benegas, Chengzhong Ye, Yun S. Song

Abstract:Genomic language models (gLMs) have shown mostly modest success in identifying evolutionarily constrained elements in mammalian genomes. To address this issue, we introduce a novel framework for training gLMs that explicitly models nucleotide evolution on phylogenetic trees using multispecies whole-genome alignments. Our approach integrates an alignment into the loss function during training but does not require it for making predictions, thereby enhancing the model's applicability. We applied this framework to train PhyloGPN, a model that excels at predicting functionally disruptive variants from a single sequence alone and demonstrates strong transfer learning capabilities.

* 15 pages, 7 figures

Via

Access Paper or Ask Questions

Genomic Language Models: Opportunities and Challenges

Jul 16, 2024

Gonzalo Benegas, Chengzhong Ye, Carlos Albors, Jianan Canal Li, Yun S. Song

Figure 1 for Genomic Language Models: Opportunities and Challenges

Figure 2 for Genomic Language Models: Opportunities and Challenges

Figure 3 for Genomic Language Models: Opportunities and Challenges

Figure 4 for Genomic Language Models: Opportunities and Challenges

Abstract:Large language models (LLMs) are having transformative impacts across a wide range of scientific fields, particularly in the biomedical sciences. Just as the goal of Natural Language Processing is to understand sequences of words, a major objective in biology is to understand biological sequences. Genomic Language Models (gLMs), which are LLMs trained on DNA sequences, have the potential to significantly advance our understanding of genomes and how DNA elements at various scales interact to give rise to complex functions. In this review, we showcase this potential by highlighting key applications of gLMs, including fitness prediction, sequence design, and transfer learning. Despite notable recent progress, however, developing effective and efficient gLMs presents numerous challenges, especially for species with large, complex genomes. We discuss major considerations for developing and evaluating gLMs.

* Review article; 25 pages, 3 figures, 1 table

Via

Access Paper or Ask Questions