Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Advancing protein language models with linguistics: a roadmap for improved interpretability

Jul 03, 2022

Mai Ha Vu, Rahmad Akbar, Philippe A. Robert, Bartlomiej Swiatczak, Victor Greiff, Geir Kjetil Sandve, Dag Trygve Truslew Haug

Figure 1 for Advancing protein language models with linguistics: a roadmap for improved interpretability

Figure 2 for Advancing protein language models with linguistics: a roadmap for improved interpretability

Figure 3 for Advancing protein language models with linguistics: a roadmap for improved interpretability

Figure 4 for Advancing protein language models with linguistics: a roadmap for improved interpretability

Share this with someone who'll enjoy it:

Abstract:Deep neural-network-based language models (LMs) are increasingly applied to large-scale protein sequence data to predict protein function. However, being largely blackbox models and thus challenging to interpret, current protein LM approaches do not contribute to a fundamental understanding of sequence-function mappings, hindering rule-based biotherapeutic drug development. We argue that guidance drawn from linguistics, a field specialized in analytical rule extraction from natural language data, can aid with building more interpretable protein LMs that have learned relevant domain-specific rules. Differences between protein sequence data and linguistic sequence data require the integration of more domain-specific knowledge in protein LMs compared to natural language LMs. Here, we provide a linguistics-based roadmap for protein LM pipeline choices with regard to training data, tokenization, token embedding, sequence embedding, and model interpretation. Combining linguistics with protein LMs enables the development of next-generation interpretable machine learning models with the potential of uncovering the biological mechanisms underlying sequence-function relationships.

* 26 pages, 4 figures

View paper on

Share this with someone who'll enjoy it:

Title:Advancing protein language models with linguistics: a roadmap for improved interpretability

Paper and Code