Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hongxia Hao

A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers

Aug 28, 2025

Ming Hu, Chenglong Ma, Wei Li, Wanghan Xu, Jiamin Wu, Jucheng Hu, Tianbin Li, Guohang Zhuang, Jiaqi Liu, Yingzhou Lu(+93 more)

Abstract:Scientific Large Language Models (Sci-LLMs) are transforming how knowledge is represented, integrated, and applied in scientific research, yet their progress is shaped by the complex nature of scientific data. This survey presents a comprehensive, data-centric synthesis that reframes the development of Sci-LLMs as a co-evolution between models and their underlying data substrate. We formulate a unified taxonomy of scientific data and a hierarchical model of scientific knowledge, emphasizing the multimodal, cross-scale, and domain-specific challenges that differentiate scientific corpora from general natural language processing datasets. We systematically review recent Sci-LLMs, from general-purpose foundations to specialized models across diverse scientific disciplines, alongside an extensive analysis of over 270 pre-/post-training datasets, showing why Sci-LLMs pose distinct demands -- heterogeneous, multi-scale, uncertainty-laden corpora that require representations preserving domain invariance and enabling cross-modal reasoning. On evaluation, we examine over 190 benchmark datasets and trace a shift from static exams toward process- and discovery-oriented assessments with advanced evaluation protocols. These data-centric analyses highlight persistent issues in scientific data development and discuss emerging solutions involving semi-automated annotation pipelines and expert validation. Finally, we outline a paradigm shift toward closed-loop systems where autonomous agents based on Sci-LLMs actively experiment, validate, and contribute to a living, evolving knowledge base. Collectively, this work provides a roadmap for building trustworthy, continually evolving artificial intelligence (AI) systems that function as a true partner in accelerating scientific discovery.

Via

Access Paper or Ask Questions

NatureLM: Deciphering the Language of Nature for Scientific Discovery

Feb 11, 2025

Yingce Xia, Peiran Jin, Shufang Xie, Liang He, Chuan Cao, Renqian Luo, Guoqing Liu, Yue Wang, Zequn Liu, Yuan-Jyue Chen(+35 more)

Figure 1 for NatureLM: Deciphering the Language of Nature for Scientific Discovery

Figure 2 for NatureLM: Deciphering the Language of Nature for Scientific Discovery

Figure 3 for NatureLM: Deciphering the Language of Nature for Scientific Discovery

Figure 4 for NatureLM: Deciphering the Language of Nature for Scientific Discovery

Abstract:Foundation models have revolutionized natural language processing and artificial intelligence, significantly enhancing how machines comprehend and generate human languages. Inspired by the success of these foundation models, researchers have developed foundation models for individual scientific domains, including small molecules, materials, proteins, DNA, and RNA. However, these models are typically trained in isolation, lacking the ability to integrate across different scientific domains. Recognizing that entities within these domains can all be represented as sequences, which together form the "language of nature", we introduce Nature Language Model (briefly, NatureLM), a sequence-based science foundation model designed for scientific discovery. Pre-trained with data from multiple scientific domains, NatureLM offers a unified, versatile model that enables various applications including: (i) generating and optimizing small molecules, proteins, RNA, and materials using text instructions; (ii) cross-domain generation/design, such as protein-to-molecule and protein-to-RNA generation; and (iii) achieving state-of-the-art performance in tasks like SMILES-to-IUPAC translation and retrosynthesis on USPTO-50k. NatureLM offers a promising generalist approach for various scientific tasks, including drug discovery (hit generation/optimization, ADMET optimization, synthesis), novel material design, and the development of therapeutic proteins or nucleotides. We have developed NatureLM models in different sizes (1 billion, 8 billion, and 46.7 billion parameters) and observed a clear improvement in performance as the model size increases.

* 81 pages

Via

Access Paper or Ask Questions

Towards Predicting Equilibrium Distributions for Molecular Systems with Deep Learning

Jun 08, 2023

Shuxin Zheng, Jiyan He, Chang Liu, Yu Shi, Ziheng Lu, Weitao Feng, Fusong Ju, Jiaxi Wang, Jianwei Zhu, Yaosen Min(+8 more)

Abstract:Advances in deep learning have greatly improved structure prediction of molecules. However, many macroscopic observations that are important for real-world applications are not functions of a single molecular structure, but rather determined from the equilibrium distribution of structures. Traditional methods for obtaining these distributions, such as molecular dynamics simulation, are computationally expensive and often intractable. In this paper, we introduce a novel deep learning framework, called Distributional Graphormer (DiG), in an attempt to predict the equilibrium distribution of molecular systems. Inspired by the annealing process in thermodynamics, DiG employs deep neural networks to transform a simple distribution towards the equilibrium distribution, conditioned on a descriptor of a molecular system, such as a chemical graph or a protein sequence. This framework enables efficient generation of diverse conformations and provides estimations of state densities. We demonstrate the performance of DiG on several molecular tasks, including protein conformation sampling, ligand structure sampling, catalyst-adsorbate sampling, and property-guided structure generation. DiG presents a significant advancement in methodology for statistically understanding molecular systems, opening up new research opportunities in molecular science.

* 80 pages, 11 figures

Via

Access Paper or Ask Questions

Predicting Lung Nodule Malignancies by Combining Deep Convolutional Neural Network and Handcrafted Features

Sep 07, 2018

Shulong Li, Panpan Xu, Bin Li, Liyuan Chen, Zhiguo Zhou, Hongxia Hao, Yingying Duan, Michael Folkert, Jianhua Ma, Steve Jiang(+1 more)

Figure 1 for Predicting Lung Nodule Malignancies by Combining Deep Convolutional Neural Network and Handcrafted Features

Figure 2 for Predicting Lung Nodule Malignancies by Combining Deep Convolutional Neural Network and Handcrafted Features

Figure 3 for Predicting Lung Nodule Malignancies by Combining Deep Convolutional Neural Network and Handcrafted Features

Figure 4 for Predicting Lung Nodule Malignancies by Combining Deep Convolutional Neural Network and Handcrafted Features

Abstract:To predict lung nodule malignancy with a high sensitivity and specificity, we propose a fusion algorithm that combines handcrafted features (HF) into the features learned at the output layer of a 3D deep convolutional neural network (CNN). First, we extracted twenty-nine handcrafted features, including nine intensity features, eight geometric features, and twelve texture features based on grey-level co-occurrence matrix (GLCM) averaged from thirteen directions. We then trained 3D CNNs modified from three state-of-the-art 2D CNN architectures (AlexNet, VGG-16 Net and Multi-crop Net) to extract the CNN features learned at the output layer. For each 3D CNN, the CNN features combined with the 29 handcrafted features were used as the input for the support vector machine (SVM) coupled with the sequential forward feature selection (SFS) method to select the optimal feature subset and construct the classifiers. The fusion algorithm takes full advantage of the handcrafted features and the highest level CNN features learned at the output layer. It can overcome the disadvantage of the handcrafted features that may not fully reflect the unique characteristics of a particular lesion by combining the intrinsic CNN features. Meanwhile, it also alleviates the requirement of a large scale annotated dataset for the CNNs based on the complementary of handcrafted features. The patient cohort includes 431 malignant nodules and 795 benign nodules extracted from the LIDC/IDRI database. For each investigated CNN architecture, the proposed fusion algorithm achieved the highest AUC, accuracy, sensitivity, and specificity scores among all competitive classification models.

* 11 pages, 5 figures, 5 tables. This work has been submitted to the IEEE for possible publication

Via

Access Paper or Ask Questions

Constructing multi-modality and multi-classifier radiomics predictive models through reliable classifier fusion

Oct 05, 2017

Zhiguo Zhou, Zhi-Jie Zhou, Hongxia Hao, Shulong Li, Xi Chen, You Zhang, Michael Folkert, Jing Wang

Figure 1 for Constructing multi-modality and multi-classifier radiomics predictive models through reliable classifier fusion

Figure 2 for Constructing multi-modality and multi-classifier radiomics predictive models through reliable classifier fusion

Figure 3 for Constructing multi-modality and multi-classifier radiomics predictive models through reliable classifier fusion

Figure 4 for Constructing multi-modality and multi-classifier radiomics predictive models through reliable classifier fusion

Abstract:Radiomics aims to extract and analyze large numbers of quantitative features from medical images and is highly promising in staging, diagnosing, and predicting outcomes of cancer treatments. Nevertheless, several challenges need to be addressed to construct an optimal radiomics predictive model. First, the predictive performance of the model may be reduced when features extracted from an individual imaging modality are blindly combined into a single predictive model. Second, because many different types of classifiers are available to construct a predictive model, selecting an optimal classifier for a particular application is still challenging. In this work, we developed multi-modality and multi-classifier radiomics predictive models that address the aforementioned issues in currently available models. Specifically, a new reliable classifier fusion strategy was proposed to optimally combine output from different modalities and classifiers. In this strategy, modality-specific classifiers were first trained, and an analytic evidential reasoning (ER) rule was developed to fuse the output score from each modality to construct an optimal predictive model. One public data set and two clinical case studies were performed to validate model performance. The experimental results indicated that the proposed ER rule based radiomics models outperformed the traditional models that rely on a single classifier or simply use combined features from different modalities.

Via

Access Paper or Ask Questions

Polarimetric Hierarchical Semantic Model and Scattering Mechanism Based PolSAR Image Classification

Jul 01, 2015

Fang Liu, Junfei Shi, Licheng Jiao, Hongying Liu, Shuyuan Yang, Jie Wu, Hongxia Hao, Jialing Yuan

Figure 1 for Polarimetric Hierarchical Semantic Model and Scattering Mechanism Based PolSAR Image Classification

Figure 2 for Polarimetric Hierarchical Semantic Model and Scattering Mechanism Based PolSAR Image Classification

Figure 3 for Polarimetric Hierarchical Semantic Model and Scattering Mechanism Based PolSAR Image Classification

Figure 4 for Polarimetric Hierarchical Semantic Model and Scattering Mechanism Based PolSAR Image Classification

Abstract:For polarimetric SAR (PolSAR) image classification, it is a challenge to classify the aggregated terrain types, such as the urban area, into semantic homogenous regions due to sharp bright-dark variations in intensity. The aggregated terrain type is formulated by the similar ground objects aggregated together. In this paper, a polarimetric hierarchical semantic model (PHSM) is firstly proposed to overcome this disadvantage based on the constructions of a primal-level and a middle-level semantic. The primal-level semantic is a polarimetric sketch map which consists of sketch segments as the sparse representation of a PolSAR image. The middle-level semantic is a region map which can extract semantic homogenous regions from the sketch map by exploiting the topological structure of sketch segments. Mapping the region map to the PolSAR image, a complex PolSAR scene is partitioned into aggregated, structural and homogenous pixel-level subspaces with the characteristics of relatively coherent terrain types in each subspace. Then, according to the characteristics of three subspaces above, three specific methods are adopted, and furthermore polarimetric information is exploited to improve the segmentation result. Experimental results on PolSAR data sets with different bands and sensors demonstrate that the proposed method is superior to the state-of-the-art methods in region homogeneity and edge preservation for terrain classification.

Via

Access Paper or Ask Questions