Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jialian Li

Boosting Deductive Reasoning with Step Signals In RLHF

Oct 12, 2024

Jialian Li, Yipin Zhang, Wei Shen, Yuzi Yan, Jian Xie, Dong Yan

Abstract:Logical reasoning is a crucial task for Large Language Models (LLMs), enabling them to tackle complex problems. Among reasoning tasks, multi-step reasoning poses a particular challenge. Grounded in the theory of formal logic, we have developed an automated method, Multi-step Deduction (MuseD), for deductive reasoning data. MuseD has allowed us to create training and testing datasets for multi-step reasoning. Our generation method enables control over the complexity of the generated instructions, facilitating training and evaluation of models across different difficulty levels. Through RLHF training, our training data has demonstrated significant improvements in logical capabilities for both in-domain of out-of-domain reasoning tasks. Additionally, we have conducted tests to assess the multi-step reasoning abilities of various models.

Via

Access Paper or Ask Questions

3D-Properties: Identifying Challenges in DPO and Charting a Path Forward

Jun 11, 2024

Yuzi Yan, Yibo Miao, Jialian Li, Yipin Zhang, Jian Xie, Zhijie Deng, Dong Yan

Figure 1 for 3D-Properties: Identifying Challenges in DPO and Charting a Path Forward

Figure 2 for 3D-Properties: Identifying Challenges in DPO and Charting a Path Forward

Figure 3 for 3D-Properties: Identifying Challenges in DPO and Charting a Path Forward

Figure 4 for 3D-Properties: Identifying Challenges in DPO and Charting a Path Forward

Abstract:Aligning large language models (LLMs) with human preference has recently gained tremendous attention, with the canonical yet costly RLHF-PPO and the simple and straightforward Direct Preference Optimization (DPO) as two examples. Despite the efficiency, DPO has rarely be used in the state-of-the-art production-level LLMs, implying its potential pathologies. In this work, we revisit DPO with a comprehensive examination of its empirical efficacy and a systematic comparison with RLHF-PPO. We identify the \textbf{3D}-properties of DPO's learning outcomes: the \textbf{D}rastic drop in the likelihood of rejected responses, the \textbf{D}egradation into LLM unlearning, and the \textbf{D}ispersion effect on unseen responses through experiments with both a carefully designed toy model and practical LLMs on tasks including mathematical problem-solving and instruction following. These findings inherently connect to some observations made by related works and we additionally contribute a plausible theoretical explanation for them. Accordingly, we propose easy regularization methods to mitigate the issues caused by \textbf{3D}-properties, improving the training stability and final performance of DPO. Our contributions also include an investigation into how the distribution of the paired preference data impacts the effectiveness of DPO. We hope this work could offer research directions to narrow the gap between reward-free preference learning methods and reward-based ones.

Via

Access Paper or Ask Questions

Exploring the LLM Journey from Cognition to Expression with Linear Representations

May 27, 2024

Yuzi Yan, Jialian Li, Yipin Zhang, Dong Yan

Figure 1 for Exploring the LLM Journey from Cognition to Expression with Linear Representations

Figure 2 for Exploring the LLM Journey from Cognition to Expression with Linear Representations

Figure 3 for Exploring the LLM Journey from Cognition to Expression with Linear Representations

Figure 4 for Exploring the LLM Journey from Cognition to Expression with Linear Representations

Abstract:This paper presents an in-depth examination of the evolution and interplay of cognitive and expressive capabilities in large language models (LLMs), with a specific focus on Baichuan-7B and Baichuan-33B, an advanced bilingual (Chinese and English) LLM series. We define and explore the model's cognitive and expressive capabilities through linear representations across three critical phases: Pretraining, Supervised Fine-Tuning (SFT), and Reinforcement Learning from Human Feedback (RLHF). Cognitive capability is defined as the quantity and quality of information conveyed by the neuron output vectors within the network, similar to the neural signal processing in human cognition. Expressive capability is defined as the model's capability to produce word-level output. Our findings unveil a sequential development pattern, where cognitive abilities are largely established during Pretraining, whereas expressive abilities predominantly advance during SFT and RLHF. Statistical analyses confirm a significant correlation between the two capabilities, suggesting that cognitive capacity may limit expressive potential. The paper also explores the theoretical underpinnings of these divergent developmental trajectories and their connection to the LLMs' architectural design. Moreover, we evaluate various optimization-independent strategies, such as few-shot learning and repeated sampling, which bridge the gap between cognitive and expressive capabilities. This research reveals the potential connection between the hidden space and the output space, contributing valuable insights into the interpretability and controllability of their training processes.

* Published in ICML 2024

Via

Access Paper or Ask Questions

Reward Informed Dreamer for Task Generalization in Reinforcement Learning

Mar 09, 2023

Chengyang Ying, Zhongkai Hao, Xinning Zhou, Hang Su, Songming Liu, Jialian Li, Dong Yan, Jun Zhu

Figure 1 for Reward Informed Dreamer for Task Generalization in Reinforcement Learning

Figure 2 for Reward Informed Dreamer for Task Generalization in Reinforcement Learning

Figure 3 for Reward Informed Dreamer for Task Generalization in Reinforcement Learning

Figure 4 for Reward Informed Dreamer for Task Generalization in Reinforcement Learning

Abstract:A long-standing goal of reinforcement learning is that algorithms can learn on training tasks and generalize well on unseen tasks like humans, where different tasks share similar dynamic with different reward functions. A general challenge is that it is nontrivial to quantitatively measure the similarities between these different tasks, which is vital for analyzing the task distribution and further designing algorithms with stronger generalization. To address this, we present a novel metric named Task Distribution Relevance (TDR) via optimal Q functions to capture the relevance of the task distribution quantitatively. In the case of tasks with a high TDR, i.e., the tasks differ significantly, we demonstrate that the Markovian policies cannot distinguish them, yielding poor performance accordingly. Based on this observation, we propose a framework of Reward Informed Dreamer (RID) with reward-informed world models, which captures invariant latent features over tasks and encodes reward signals into policies for distinguishing different tasks. In RID, we calculate the corresponding variational lower bound of the log-likelihood on the data, which includes a novel term to distinguish different tasks via states, based on reward-informed world models. Finally, extensive experiments in DeepMind control suite demonstrate that RID can significantly improve the performance of handling different tasks at the same time, especially for those with high TDR, and further generalize to unseen tasks effectively.

Via

Access Paper or Ask Questions

LiDARCap: Long-range Marker-less 3D Human Motion Capture with LiDAR Point Clouds

Mar 28, 2022

Jialian Li, Jingyi Zhang, Zhiyong Wang, Siqi Shen, Chenglu Wen, Yuexin Ma, Lan Xu, Jingyi Yu, Cheng Wang

Figure 1 for LiDARCap: Long-range Marker-less 3D Human Motion Capture with LiDAR Point Clouds

Figure 2 for LiDARCap: Long-range Marker-less 3D Human Motion Capture with LiDAR Point Clouds

Figure 3 for LiDARCap: Long-range Marker-less 3D Human Motion Capture with LiDAR Point Clouds

Figure 4 for LiDARCap: Long-range Marker-less 3D Human Motion Capture with LiDAR Point Clouds

Abstract:Existing motion capture datasets are largely short-range and cannot yet fit the need of long-range applications. We propose LiDARHuman26M, a new human motion capture dataset captured by LiDAR at a much longer range to overcome this limitation. Our dataset also includes the ground truth human motions acquired by the IMU system and the synchronous RGB images. We further present a strong baseline method, LiDARCap, for LiDAR point cloud human motion capture. Specifically, we first utilize PointNet++ to encode features of points and then employ the inverse kinematics solver and SMPL optimizer to regress the pose through aggregating the temporally encoded features hierarchically. Quantitative and qualitative experiments show that our method outperforms the techniques based only on RGB images. Ablation experiments demonstrate that our dataset is challenging and worthy of further research. Finally, the experiments on the KITTI Dataset and the Waymo Open Dataset show that our method can be generalized to different LiDAR sensor settings.

Via

Access Paper or Ask Questions

Policy Learning for Robust Markov Decision Process with a Mismatched Generative Model

Mar 15, 2022

Jialian Li, Tongzheng Ren, Dong Yan, Hang Su, Jun Zhu

Figure 1 for Policy Learning for Robust Markov Decision Process with a Mismatched Generative Model

Figure 2 for Policy Learning for Robust Markov Decision Process with a Mismatched Generative Model

Figure 3 for Policy Learning for Robust Markov Decision Process with a Mismatched Generative Model

Figure 4 for Policy Learning for Robust Markov Decision Process with a Mismatched Generative Model

Abstract:In high-stake scenarios like medical treatment and auto-piloting, it's risky or even infeasible to collect online experimental data to train the agent. Simulation-based training can alleviate this issue, but may suffer from its inherent mismatches from the simulator and real environment. It is therefore imperative to utilize the simulator to learn a robust policy for the real-world deployment. In this work, we consider policy learning for Robust Markov Decision Processes (RMDP), where the agent tries to seek a robust policy with respect to unexpected perturbations on the environments. Specifically, we focus on the setting where the training environment can be characterized as a generative model and a constrained perturbation can be added to the model during testing. Our goal is to identify a near-optimal robust policy for the perturbed testing environment, which introduces additional technical difficulties as we need to simultaneously estimate the training environment uncertainty from samples and find the worst-case perturbation for testing. To solve this issue, we propose a generic method which formalizes the perturbation as an opponent to obtain a two-player zero-sum game, and further show that the Nash Equilibrium corresponds to the robust policy. We prove that, with a polynomial number of samples from the generative model, our algorithm can find a near-optimal robust policy with a high probability. Our method is able to deal with general perturbations under some mild assumptions and can also be extended to more complex problems like robust partial observable Markov decision process, thanks to the game-theoretical formulation.

* AAAI 2022

Via

Access Paper or Ask Questions

Nearly Horizon-Free Offline Reinforcement Learning

Mar 25, 2021

Tongzheng Ren, Jialian Li, Bo Dai, Simon S. Du, Sujay Sanghavi

Figure 1 for Nearly Horizon-Free Offline Reinforcement Learning

Figure 2 for Nearly Horizon-Free Offline Reinforcement Learning

Abstract:We revisit offline reinforcement learning on episodic time-homogeneous tabular Markov Decision Processes with $S$ states, $A$ actions and planning horizon $H$. Given the collected $N$ episodes data with minimum cumulative reaching probability $d_m$, we obtain the first set of nearly $H$-free sample complexity bounds for evaluation and planning using the empirical MDPs: 1.For the offline evaluation, we obtain an $\tilde{O}\left(\sqrt{\frac{1}{Nd_m}} \right)$ error rate, which matches the lower bound and does not have additional dependency on $\poly\left(S,A\right)$ in higher-order term, that is different from previous works~\citep{yin2020near,yin2020asymptotically}. 2.For the offline policy optimization, we obtain an $\tilde{O}\left(\sqrt{\frac{1}{Nd_m}} + \frac{S}{Nd_m}\right)$ error rate, improving upon the best known result by \cite{cui2020plug}, which has additional $H$ and $S$ factors in the main term. Furthermore, this bound approaches the $\Omega\left(\sqrt{\frac{1}{Nd_m}}\right)$ lower bound up to logarithmic factors and a high-order term. To the best of our knowledge, these are the first set of nearly horizon-free bounds in offline reinforcement learning.

Via

Access Paper or Ask Questions

Fast Regularity-Constrained Plane Reconstruction

May 20, 2019

Yangbin Lin, Jialian Li, Cheng Wang, Zhonggui Chen, Zongyue Wang, Jonathan Li

Figure 1 for Fast Regularity-Constrained Plane Reconstruction

Figure 2 for Fast Regularity-Constrained Plane Reconstruction

Figure 3 for Fast Regularity-Constrained Plane Reconstruction

Figure 4 for Fast Regularity-Constrained Plane Reconstruction

Abstract:Man-made environments typically comprise planar structures that exhibit numerous geometric relationships, such as parallelism, coplanarity, and orthogonality. Making full use of these relationships can considerably improve the robustness of algorithmic plane reconstruction of complex scenes. This research leverages a constraint model requiring minimal prior knowledge to implicitly establish relationships among planes. We introduce a method based on energy minimization to reconstruct the planes consistent with our constraint model. The proposed algorithm is efficient, easily to understand, and simple to implement. The experimental results show that our algorithm successfully reconstructs planes under high percentages of noise and outliers. This is superior to other state-of-the-art regularity-constrained plane reconstruction methods in terms of speed and robustness.

Via

Access Paper or Ask Questions

Lazy-CFR: a fast regret minimization algorithm for extensive games with imperfect information

Oct 10, 2018

Yichi Zhou, Tongzheng Ren, Jialian Li, Dong Yan, Jun Zhu

Figure 1 for Lazy-CFR: a fast regret minimization algorithm for extensive games with imperfect information

Figure 2 for Lazy-CFR: a fast regret minimization algorithm for extensive games with imperfect information

Abstract:In this paper, we focus on solving two-player zero-sum extensive games with imperfect information. Counterfactual regret minimization (CFR) is the most popular algorithm on solving such games and achieves state-of-the-art performance in practice. However, the performance of CFR is not fully understood, since empirical results on the regret are much better than the upper bound proved in \cite{zinkevich2008regret}. Another issue of CFR is that CFR has to traverse the whole game tree in each round, which is not tolerable in large scale games. In this paper, we present a novel technique, lazy update, which can avoid traversing the whole game tree in CFR. Further, we present a novel analysis on the CFR with lazy update. Our analysis can also be applied to the vanilla CFR, which results in a much tighter regret bound than that proved in \cite{zinkevich2008regret}. Inspired by lazy update, we further present a novel CFR variant, named Lazy-CFR. Compared to traversing $O(|\mathcal{I}|)$ information sets in vanilla CFR, Lazy-CFR needs only to traverse $O(\sqrt{|\mathcal{I}|})$ information sets per round while the regret bound almost keep the same, where $\mathcal{I}$ is the class of all information sets. As a result, Lazy-CFR shows better convergence result compared with vanilla CFR. Experimental results consistently show that Lazy-CFR outperforms the vanilla CFR significantly.

Via

Access Paper or Ask Questions

The YouTube-8M Kaggle Competition: Challenges and Methods

Jul 13, 2017

Haosheng Zou, Kun Xu, Jialian Li, Jun Zhu

Figure 1 for The YouTube-8M Kaggle Competition: Challenges and Methods

Figure 2 for The YouTube-8M Kaggle Competition: Challenges and Methods

Figure 3 for The YouTube-8M Kaggle Competition: Challenges and Methods

Figure 4 for The YouTube-8M Kaggle Competition: Challenges and Methods

Abstract:We took part in the YouTube-8M Video Understanding Challenge hosted on Kaggle, and achieved the 10th place within less than one month's time. In this paper, we present an extensive analysis and solution to the underlying machine-learning problem based on frame-level data, where major challenges are identified and corresponding preliminary methods are proposed. It's noteworthy that, with merely the proposed strategies and uniformly-averaging multi-crop ensemble was it sufficient for us to reach our ranking. We also report the methods we believe to be promising but didn't have enough time to train to convergence. We hope this paper could serve, to some extent, as a review and guideline of the YouTube-8M multi-label video classification benchmark, inspiring future attempts and research.

* accepted to CVPR'17 Workshop on YouTube-8M Large-Scale Video Understanding (oral presentation); code is at https://github.com/taufikxu/youtube on branches kunxu and zhs

Via

Access Paper or Ask Questions