Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shangzhe Li

Language Model Distillation: A Temporal Difference Imitation Learning Perspective

May 24, 2025

Zishun Yu, Shangzhe Li, Xinhua Zhang

Abstract:Large language models have led to significant progress across many NLP tasks, although their massive sizes often incur substantial computational costs. Distillation has become a common practice to compress these large and highly capable models into smaller, more efficient ones. Many existing language model distillation methods can be viewed as behavior cloning from the perspective of imitation learning or inverse reinforcement learning. This viewpoint has inspired subsequent studies that leverage (inverse) reinforcement learning techniques, including variations of behavior cloning and temporal difference learning methods. Rather than proposing yet another specific temporal difference method, we introduce a general framework for temporal difference-based distillation by exploiting the distributional sparsity of the teacher model. Specifically, it is often observed that language models assign most probability mass to a small subset of tokens. Motivated by this observation, we design a temporal difference learning framework that operates on a reduced action space (a subset of vocabulary), and demonstrate how practical algorithms can be derived and the resulting performance improvements.

Via

Access Paper or Ask Questions

Coupled Distributional Random Expert Distillation for World Model Online Imitation Learning

May 04, 2025

Shangzhe Li, Zhiao Huang, Hao Su

Abstract:Imitation Learning (IL) has achieved remarkable success across various domains, including robotics, autonomous driving, and healthcare, by enabling agents to learn complex behaviors from expert demonstrations. However, existing IL methods often face instability challenges, particularly when relying on adversarial reward or value formulations in world model frameworks. In this work, we propose a novel approach to online imitation learning that addresses these limitations through a reward model based on random network distillation (RND) for density estimation. Our reward model is built on the joint estimation of expert and behavioral distributions within the latent space of the world model. We evaluate our method across diverse benchmarks, including DMControl, Meta-World, and ManiSkill2, showcasing its ability to deliver stable performance and achieve expert-level results in both locomotion and manipulation tasks. Our approach demonstrates improved stability over adversarial methods while maintaining expert-level performance.

Via

Access Paper or Ask Questions

Molecular Graph Contrastive Learning with Line Graph

Jan 15, 2025

Xueyuan Chen, Shangzhe Li, Ruomei Liu, Bowen Shi, Jiaheng Liu, Junran Wu, Ke Xu

Abstract:Trapped by the label scarcity in molecular property prediction and drug design, graph contrastive learning (GCL) came forward. Leading contrastive learning works show two kinds of view generators, that is, random or learnable data corruption and domain knowledge incorporation. While effective, the two ways also lead to molecular semantics altering and limited generalization capability, respectively. To this end, we relate the \textbf{L}in\textbf{E} graph with \textbf{MO}lecular graph co\textbf{N}trastive learning and propose a novel method termed \textit{LEMON}. Specifically, by contrasting the given graph with the corresponding line graph, the graph encoder can freely encode the molecular semantics without omission. Furthermore, we present a new patch with edge attribute fusion and two local contrastive losses enhance information transmission and tackle hard negative samples. Compared with state-of-the-art (SOTA) methods for view generation, superior performance on molecular property prediction suggests the effectiveness of our proposed framework.

Via

Access Paper or Ask Questions

Reward-free World Models for Online Imitation Learning

Oct 17, 2024

Shangzhe Li, Zhiao Huang, Hao Su

Abstract:Imitation learning (IL) enables agents to acquire skills directly from expert demonstrations, providing a compelling alternative to reinforcement learning. However, prior online IL approaches struggle with complex tasks characterized by high-dimensional inputs and complex dynamics. In this work, we propose a novel approach to online imitation learning that leverages reward-free world models. Our method learns environmental dynamics entirely in latent spaces without reconstruction, enabling efficient and accurate modeling. We adopt the inverse soft-Q learning objective, reformulating the optimization process in the Q-policy space to mitigate the instability associated with traditional optimization in the reward-policy space. By employing a learned latent dynamics model and planning for control, our approach consistently achieves stable, expert-level performance in tasks with high-dimensional observation or action spaces and intricate dynamics. We evaluate our method on a diverse set of benchmarks, including DMControl, MyoSuite, and ManiSkill2, demonstrating superior empirical performance compared to existing approaches.

Via

Access Paper or Ask Questions

HILL: Hierarchy-aware Information Lossless Contrastive Learning for Hierarchical Text Classification

Mar 26, 2024

He Zhu, Junran Wu, Ruomei Liu, Yue Hou, Ze Yuan, Shangzhe Li, Yicheng Pan, Ke Xu

Figure 1 for HILL: Hierarchy-aware Information Lossless Contrastive Learning for Hierarchical Text Classification

Figure 2 for HILL: Hierarchy-aware Information Lossless Contrastive Learning for Hierarchical Text Classification

Figure 3 for HILL: Hierarchy-aware Information Lossless Contrastive Learning for Hierarchical Text Classification

Figure 4 for HILL: Hierarchy-aware Information Lossless Contrastive Learning for Hierarchical Text Classification

Abstract:Existing self-supervised methods in natural language processing (NLP), especially hierarchical text classification (HTC), mainly focus on self-supervised contrastive learning, extremely relying on human-designed augmentation rules to generate contrastive samples, which can potentially corrupt or distort the original information. In this paper, we tend to investigate the feasibility of a contrastive learning scheme in which the semantic and syntactic information inherent in the input sample is adequately reserved in the contrastive samples and fused during the learning process. Specifically, we propose an information lossless contrastive learning strategy for HTC, namely \textbf{H}ierarchy-aware \textbf{I}nformation \textbf{L}ossless contrastive \textbf{L}earning (HILL), which consists of a text encoder representing the input document, and a structure encoder directly generating the positive sample. The structure encoder takes the document embedding as input, extracts the essential syntactic information inherent in the label hierarchy with the principle of structural entropy minimization, and injects the syntactic information into the text representation via hierarchical representation learning. Experiments on three common datasets are conducted to verify the superiority of HILL.

* Accepted by NAACL 2024

Via

Access Paper or Ask Questions

Distilling Conditional Diffusion Models for Offline Reinforcement Learning through Trajectory Stitching

Feb 01, 2024

Shangzhe Li, Xinhua Zhang

Abstract:Deep generative models have recently emerged as an effective approach to offline reinforcement learning. However, their large model size poses challenges in computation. We address this issue by proposing a knowledge distillation method based on data augmentation. In particular, high-return trajectories are generated from a conditional diffusion model, and they are blended with the original trajectories through a novel stitching algorithm that leverages a new reward generator. Applying the resulting dataset to behavioral cloning, the learned shallow policy whose size is much smaller outperforms or nearly matches deep generative planners on several D4RL benchmarks.

Via

Access Paper or Ask Questions

SEGA: Structural Entropy Guided Anchor View for Graph Contrastive Learning

May 08, 2023

Junran Wu, Xueyuan Chen, Bowen Shi, Shangzhe Li, Ke Xu

Abstract:In contrastive learning, the choice of ``view'' controls the information that the representation captures and influences the performance of the model. However, leading graph contrastive learning methods generally produce views via random corruption or learning, which could lead to the loss of essential information and alteration of semantic information. An anchor view that maintains the essential information of input graphs for contrastive learning has been hardly investigated. In this paper, based on the theory of graph information bottleneck, we deduce the definition of this anchor view; put differently, \textit{the anchor view with essential information of input graph is supposed to have the minimal structural uncertainty}. Furthermore, guided by structural entropy, we implement the anchor view, termed \textbf{SEGA}, for graph contrastive learning. We extensively validate the proposed anchor view on various benchmarks regarding graph classification under unsupervised, semi-supervised, and transfer learning and achieve significant performance boosts compared to the state-of-the-art methods.

* ICML'23

Via

Access Paper or Ask Questions

Structural Entropy Guided Graph Hierarchical Pooling

Jun 26, 2022

Junran Wu, Xueyuan Chen, Ke Xu, Shangzhe Li

Figure 1 for Structural Entropy Guided Graph Hierarchical Pooling

Figure 2 for Structural Entropy Guided Graph Hierarchical Pooling

Figure 3 for Structural Entropy Guided Graph Hierarchical Pooling

Figure 4 for Structural Entropy Guided Graph Hierarchical Pooling

Abstract:Following the success of convolution on non-Euclidean space, the corresponding pooling approaches have also been validated on various tasks regarding graphs. However, because of the fixed compression quota and stepwise pooling design, these hierarchical pooling methods still suffer from local structure damage and suboptimal problem. In this work, inspired by structural entropy, we propose a hierarchical pooling approach, SEP, to tackle the two issues. Specifically, without assigning the layer-specific compression quota, a global optimization algorithm is designed to generate the cluster assignment matrices for pooling at once. Then, we present an illustration of the local structure damage from previous methods in the reconstruction of ring and grid synthetic graphs. In addition to SEP, we further design two classification models, SEP-G and SEP-N for graph classification and node classification, respectively. The results show that SEP outperforms state-of-the-art graph pooling methods on graph classification benchmarks and obtains superior performance on node classifications.

* Accepted by ICML 2022

Via

Access Paper or Ask Questions

A Simple yet Effective Method for Graph Classification

Jun 06, 2022

Junran Wu, Shangzhe Li, Jianhao Li, Yicheng Pan, Ke Xu

Figure 1 for A Simple yet Effective Method for Graph Classification

Figure 2 for A Simple yet Effective Method for Graph Classification

Figure 3 for A Simple yet Effective Method for Graph Classification

Figure 4 for A Simple yet Effective Method for Graph Classification

Abstract:In deep neural networks, better results can often be obtained by increasing the complexity of previously developed basic models. However, it is unclear whether there is a way to boost performance by decreasing the complexity of such models. Intuitively, given a problem, a simpler data structure comes with a simpler algorithm. Here, we investigate the feasibility of improving graph classification performance while simplifying the learning process. Inspired by structural entropy on graphs, we transform the data sample from graphs to coding trees, which is a simpler but essential structure for graph data. Furthermore, we propose a novel message passing scheme, termed hierarchical reporting, in which features are transferred from leaf nodes to root nodes by following the hierarchical structure of coding trees. We then present a tree kernel and a convolutional network to implement our scheme for graph classification. With the designed message passing scheme, the tree kernel and convolutional network have a lower runtime complexity of $O(n)$ than Weisfeiler-Lehman subtree kernel and other graph neural networks of at least $O(hm)$. We empirically validate our methods with several graph classification benchmarks and demonstrate that they achieve better performance and lower computational consumption than competing approaches.

* Accepted by IJCAI2022. arXiv admin note: substantial text overlap with arXiv:2109.02027

Via

Access Paper or Ask Questions

Price graphs: Utilizing the structural information of financial time series for stock prediction

Jun 11, 2021

Junran Wu, Ke Xu, Xueyuan Chen, Shangzhe Li, Jichang Zhao

Figure 1 for Price graphs: Utilizing the structural information of financial time series for stock prediction

Figure 2 for Price graphs: Utilizing the structural information of financial time series for stock prediction

Figure 3 for Price graphs: Utilizing the structural information of financial time series for stock prediction

Figure 4 for Price graphs: Utilizing the structural information of financial time series for stock prediction

Abstract:Stock prediction, with the purpose of forecasting the future price trends of stocks, is crucial for maximizing profits from stock investments. While great research efforts have been devoted to exploiting deep neural networks for improved stock prediction, two major issues still exist in recent studies. First, the capture of long-range dependencies in time series is not sufficiently addressed. Second, the chaotic property of financial time series fundamentally lowers prediction performance. In this study, we propose a novel framework to address both issues regarding stock prediction. Specifically, in terms of transforming time series into complex networks, we convert market price series into graphs. Then, structural information, referring to associations among temporal points and the node weights, is extracted from the mapped graphs to resolve the problems regarding long-range dependencies and the chaotic property. We take graph embeddings to represent the associations among temporal points as the prediction model inputs. Node weights are used as a priori knowledge to enhance the learning of temporal attention. The effectiveness of our proposed framework is validated using real-world stock data, and our approach obtains the best performance among several state-of-the-art benchmarks. Moreover, in the conducted trading simulations, our framework further obtains the highest cumulative profits. Our results supplement the existing applications of complex network methods in the financial realm and provide insightful implications for investment applications regarding decision support in financial markets.

Via

Access Paper or Ask Questions