Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xuhong Li

From Macro to Micro: Probing Dataset Diversity in Language Model Fine-Tuning

May 30, 2025

Haoyu Li, Xuhong Li, Yiming Dong, Kun Liu

Abstract:Dataset diversity plays a pivotal role for the successful training of many machine learning models, particularly in the supervised fine-tuning (SFT) stage of large language model (LLM) development. Despite increasing recognition of its importance, systematic analyses of dataset diversity still remain underexplored. To address this gap, this work presents a systematic taxonomy of existing diversity-control strategies, which primarily focus on the instruction component, operating at either macroscopic (entire instruction semantics) or mesoscopic levels (instruction units), and furthermore introduces a novel analysis of microscopic diversity within the response component, specifically analyzing the statistical distribution of tokens in SFT training samples. In the experimental evaluation, we construct fixed-size datasets (e.g., 10,000 samples each) from a corpus of 117,000 open-source SFT samples, incorporating six distinct diversity-control strategies spanning macro-, meso-, and microscopic levels applied to both instructions and responses. We then fine-tune LLMs on these datasets to assess the six diversity-control strategies. Results reveal that while macroscopic and mesoscopic strategies lead to higher performance with increasing diversity, the microscopic strategy in responses exhibits both a stronger correlation between model performance and the degree of diversity and superior performance with maximum diversity across all strategies. These findings offer actionable insights for constructing high-performance SFT datasets.

Via

Access Paper or Ask Questions

SOLA-GCL: Subgraph-Oriented Learnable Augmentation Method for Graph Contrastive Learning

Mar 13, 2025

Tianhao Peng, Xuhong Li, Haitao Yuan, Yuchen Li, Haoyi Xiong

Abstract:Graph contrastive learning has emerged as a powerful technique for learning graph representations that are robust and discriminative. However, traditional approaches often neglect the critical role of subgraph structures, particularly the intra-subgraph characteristics and inter-subgraph relationships, which are crucial for generating informative and diverse contrastive pairs. These subgraph features are crucial as they vary significantly across different graph types, such as social networks where they represent communities, and biochemical networks where they symbolize molecular interactions. To address this issue, our work proposes a novel subgraph-oriented learnable augmentation method for graph contrastive learning, termed SOLA-GCL, that centers around subgraphs, taking full advantage of the subgraph information for data augmentation. Specifically, SOLA-GCL initially partitions a graph into multiple densely connected subgraphs based on their intrinsic properties. To preserve and enhance the unique characteristics inherent to subgraphs, a graph view generator optimizes augmentation strategies for each subgraph, thereby generating tailored views for graph contrastive learning. This generator uses a combination of intra-subgraph and inter-subgraph augmentation strategies, including node dropping, feature masking, intra-edge perturbation, inter-edge perturbation, and subgraph swapping. Extensive experiments have been conducted on various graph learning applications, ranging from social networks to molecules, under semi-supervised learning, unsupervised learning, and transfer learning settings to demonstrate the superiority of our proposed approach over the state-of-the-art in GCL.

Via

Access Paper or Ask Questions

Pre-trained Molecular Language Models with Random Functional Group Masking

Nov 03, 2024

Tianhao Peng, Yuchen Li, Xuhong Li, Jiang Bian, Zeke Xie, Ning Sui, Shahid Mumtaz, Yanwu Xu, Linghe Kong, Haoyi Xiong

Figure 1 for Pre-trained Molecular Language Models with Random Functional Group Masking

Figure 2 for Pre-trained Molecular Language Models with Random Functional Group Masking

Figure 3 for Pre-trained Molecular Language Models with Random Functional Group Masking

Figure 4 for Pre-trained Molecular Language Models with Random Functional Group Masking

Abstract:Recent advancements in computational chemistry have leveraged the power of trans-former-based language models, such as MoLFormer, pre-trained using a vast amount of simplified molecular-input line-entry system (SMILES) sequences, to understand and predict molecular properties and activities, a critical step in fields like drug discovery and materials science. To further improve performance, researchers have introduced graph neural networks with graph-based molecular representations, such as GEM, incorporating the topology, geometry, 2D or even 3D structures of molecules into pre-training. While most of molecular graphs in existing studies were automatically converted from SMILES sequences, it is to assume that transformer-based language models might be able to implicitly learn structure-aware representations from SMILES sequences. In this paper, we propose \ours{} -- a SMILES-based \underline{\em M}olecular \underline{\em L}anguage \underline{\em M}odel, which randomly masking SMILES subsequences corresponding to specific molecular \underline{\em F}unctional \underline{\em G}roups to incorporate structure information of atoms during the pre-training phase. This technique aims to compel the model to better infer molecular structures and properties, thus enhancing its predictive capabilities. Extensive experimental evaluations across 11 benchmark classification and regression tasks in the chemical domain demonstrate the robustness and superiority of \ours{}. Our findings reveal that \ours{} outperforms existing pre-training models, either based on SMILES or graphs, in 9 out of the 11 downstream tasks, ranking as a close second in the remaining ones.

* Under review

Via

Access Paper or Ask Questions

Towards Automated Data Sciences with Natural Language and SageCopilot: Practices and Lessons Learned

Jul 21, 2024

Yuan Liao, Jiang Bian, Yuhui Yun, Shuo Wang, Yubo Zhang, Jiaming Chu, Tao Wang, Kewei Li, Yuchen Li, Xuhong Li(+2 more)

Figure 1 for Towards Automated Data Sciences with Natural Language and SageCopilot: Practices and Lessons Learned

Figure 2 for Towards Automated Data Sciences with Natural Language and SageCopilot: Practices and Lessons Learned

Figure 3 for Towards Automated Data Sciences with Natural Language and SageCopilot: Practices and Lessons Learned

Figure 4 for Towards Automated Data Sciences with Natural Language and SageCopilot: Practices and Lessons Learned

Abstract:While the field of NL2SQL has made significant advancements in translating natural language instructions into executable SQL scripts for data querying and processing, achieving full automation within the broader data science pipeline - encompassing data querying, analysis, visualization, and reporting - remains a complex challenge. This study introduces SageCopilot, an advanced, industry-grade system system that automates the data science pipeline by integrating Large Language Models (LLMs), Autonomous Agents (AutoAgents), and Language User Interfaces (LUIs). Specifically, SageCopilot incorporates a two-phase design: an online component refining users' inputs into executable scripts through In-Context Learning (ICL) and running the scripts for results reporting & visualization, and an offline preparing demonstrations requested by ICL in the online phase. A list of trending strategies such as Chain-of-Thought and prompt-tuning have been used to augment SageCopilot for enhanced performance. Through rigorous testing and comparative analysis against prompt-based solutions, SageCopilot has been empirically validated to achieve superior end-to-end performance in generating or executing scripts and offering results with visualization, backed by real-world datasets. Our in-depth ablation studies highlight the individual contributions of various components and strategies used by SageCopilot to the end-to-end correctness for data sciences.

Via

Access Paper or Ask Questions

Converging Paradigms: The Synergy of Symbolic and Connectionist AI in LLM-Empowered Autonomous Agents

Jul 11, 2024

Haoyi Xiong, Zhiyuan Wang, Xuhong Li, Jiang Bian, Zeke Xie, Shahid Mumtaz, Laura E. Barnes

Figure 1 for Converging Paradigms: The Synergy of Symbolic and Connectionist AI in LLM-Empowered Autonomous Agents

Figure 2 for Converging Paradigms: The Synergy of Symbolic and Connectionist AI in LLM-Empowered Autonomous Agents

Figure 3 for Converging Paradigms: The Synergy of Symbolic and Connectionist AI in LLM-Empowered Autonomous Agents

Figure 4 for Converging Paradigms: The Synergy of Symbolic and Connectionist AI in LLM-Empowered Autonomous Agents

Abstract:This article explores the convergence of connectionist and symbolic artificial intelligence (AI), from historical debates to contemporary advancements. Traditionally considered distinct paradigms, connectionist AI focuses on neural networks, while symbolic AI emphasizes symbolic representation and logic. Recent advancements in large language models (LLMs), exemplified by ChatGPT and GPT-4, highlight the potential of connectionist architectures in handling human language as a form of symbols. The study argues that LLM-empowered Autonomous Agents (LAAs) embody this paradigm convergence. By utilizing LLMs for text-based knowledge modeling and representation, LAAs integrate neuro-symbolic AI principles, showcasing enhanced reasoning and decision-making capabilities. Comparing LAAs with Knowledge Graphs within the neuro-symbolic AI theme highlights the unique strengths of LAAs in mimicking human-like reasoning processes, scaling effectively with large datasets, and leveraging in-context samples without explicit re-training. The research underscores promising avenues in neuro-vector-symbolic integration, instructional encoding, and implicit reasoning, aimed at further enhancing LAA capabilities. By exploring the progression of neuro-symbolic AI and proposing future research trajectories, this work advances the understanding and development of AI technologies.

Via

Access Paper or Ask Questions

When Search Engine Services meet Large Language Models: Visions and Challenges

Jun 28, 2024

Haoyi Xiong, Jiang Bian, Yuchen Li, Xuhong Li, Mengnan Du, Shuaiqiang Wang, Dawei Yin, Sumi Helal

Abstract:Combining Large Language Models (LLMs) with search engine services marks a significant shift in the field of services computing, opening up new possibilities to enhance how we search for and retrieve information, understand content, and interact with internet services. This paper conducts an in-depth examination of how integrating LLMs with search engines can mutually benefit both technologies. We focus on two main areas: using search engines to improve LLMs (Search4LLM) and enhancing search engine functions using LLMs (LLM4Search). For Search4LLM, we investigate how search engines can provide diverse high-quality datasets for pre-training of LLMs, how they can use the most relevant documents to help LLMs learn to answer queries more accurately, how training LLMs with Learning-To-Rank (LTR) tasks can enhance their ability to respond with greater precision, and how incorporating recent search results can make LLM-generated content more accurate and current. In terms of LLM4Search, we examine how LLMs can be used to summarize content for better indexing by search engines, improve query outcomes through optimization, enhance the ranking of search results by analyzing document relevance, and help in annotating data for learning-to-rank tasks in various learning contexts. However, this promising integration comes with its challenges, which include addressing potential biases and ethical issues in training models, managing the computational and other costs of incorporating LLMs into search services, and continuously updating LLM training with the ever-changing web content. We discuss these challenges and chart out required research directions to address them. We also discuss broader implications for service computing, such as scalability, privacy concerns, and the need to adapt search engine architectures for these advanced models.

* Under Review

Via

Access Paper or Ask Questions

Tokenization Falling Short: The Curse of Tokenization

Jun 17, 2024

Yekun Chai, Yewei Fang, Qiwei Peng, Xuhong Li

Abstract:Language models typically tokenize raw text into sequences of subword identifiers from a predefined vocabulary, a process inherently sensitive to typographical errors, length variations, and largely oblivious to the internal structure of tokens-issues we term the curse of tokenization. In this study, we delve into these drawbacks and demonstrate that large language models (LLMs) remain susceptible to these problems. This study systematically investigates these challenges and their impact on LLMs through three critical research questions: (1) complex problem solving, (2) token structure probing, and (3) resilience to typographical variation. Our findings reveal that scaling model parameters can mitigate the issue of tokenization; however, LLMs still suffer from biases induced by typos and other text format variations. Our experiments show that subword regularization such as BPE-dropout can mitigate this issue. We will release our code and data to facilitate further research.

Via

Access Paper or Ask Questions

Interpretable Machine Learning for Weather and Climate Prediction: A Survey

Mar 24, 2024

Ruyi Yang, Jingyu Hu, Zihao Li, Jianli Mu, Tingzhao Yu, Jiangjiang Xia, Xuhong Li, Aritra Dasgupta, Haoyi Xiong

Figure 1 for Interpretable Machine Learning for Weather and Climate Prediction: A Survey

Figure 2 for Interpretable Machine Learning for Weather and Climate Prediction: A Survey

Figure 3 for Interpretable Machine Learning for Weather and Climate Prediction: A Survey

Figure 4 for Interpretable Machine Learning for Weather and Climate Prediction: A Survey

Abstract:Advanced machine learning models have recently achieved high predictive accuracy for weather and climate prediction. However, these complex models often lack inherent transparency and interpretability, acting as "black boxes" that impede user trust and hinder further model improvements. As such, interpretable machine learning techniques have become crucial in enhancing the credibility and utility of weather and climate modeling. In this survey, we review current interpretable machine learning approaches applied to meteorological predictions. We categorize methods into two major paradigms: 1) Post-hoc interpretability techniques that explain pre-trained models, such as perturbation-based, game theory based, and gradient-based attribution methods. 2) Designing inherently interpretable models from scratch using architectures like tree ensembles and explainable neural networks. We summarize how each technique provides insights into the predictions, uncovering novel meteorological relationships captured by machine learning. Lastly, we discuss research challenges around achieving deeper mechanistic interpretations aligned with physical principles, developing standardized evaluation benchmarks, integrating interpretability into iterative model development workflows, and providing explainability for large foundation models.

* 26 pages, 5 figures

Via

Access Paper or Ask Questions

A Wideband Distributed Massive MIMO Channel Sounder for Communication and Sensing

Mar 18, 2024

Michiel Sandra, Christian Nelson, Xuhong Li, Xuesong Cai, Fredrik Tufvesson, Anders J Johansson

Abstract:Channel sounding is a vital step in understanding wireless channels for the design and deployment of wireless communication systems. In this paper, we present the design and implementation of a coherent distributed massive MIMO channel sounder operating at 5-6 GHz with a bandwidth of 400 MHz based on the NI USRP X410. Through the integration of transceiver chains and RF switches, the design facilitates the use of a larger number of antennas without significant compromise in dynamic capability. Our current implementation is capable of measuring thousands of antenna combinations within tens of milliseconds. Every radio frequency switch is seamlessly integrated with a 16-element antenna array, making the antennas more practical to be transported and flexibly distributed. In addition, the channel sounder features real-time processing to reduce the data stream to the host computer and increase the signal-to-noise ratio. The design and implementation are verified through two measurements in an indoor laboratory environment. The first measurement entails a single-antenna robot as transmitter and 128 distributed receiving antennas. The second measurement demonstrates a passive sensing scenario with a walking person. We evaluate the results of both measurements using the super-resolution algorithm SAGE. The results demonstrate the great potential of the presented sounding system for providing high-quality radio channel measurements, contributing to high-resolution channel estimation, characterization, and active and passive sensing in realistic and dynamic scenarios.

Via

Access Paper or Ask Questions

A Belief Propagation Algorithm for Multipath-based SLAM with Multiple Map Features: A mmWave MIMO Application

Mar 15, 2024

Xuhong Li, Xuesong Cai, Erik Leitinger, Fredrik Tufvesson

Abstract:In this paper, we present a multipath-based simultaneous localization and mapping (SLAM) algorithm that continuously adapts mulitiple map feature (MF) models describing specularly reflected multipath components (MPCs) from flat surfaces and point-scattered MPCs, respectively. We develop a Bayesian model for sequential detection and estimation of interacting MF model parameters, MF states and mobile agent's state including position and orientation. The Bayesian model is represented by a factor graph enabling the use of belief propagation (BP) for efficient computation of the marginal posterior distributions. The algorithm also exploits amplitude information enabling reliable detection of weak MFs associated with MPCs of very low signal-to-noise ratios (SNRs). The performance of the proposed algorithm is evaluated using real millimeter-wave (mmWave) multiple-input-multiple-output (MIMO) measurements with single base station setup. Results demonstrate the excellent localization and mapping performance of the proposed algorithm in challenging dynamic outdoor scenarios.

* 7 pages (two column), 4 figures, accepted to 2024 IEEE International Conference on Communications (ICC), WS05 Workshop

Via

Access Paper or Ask Questions