Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yinmin Zhang

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Mar 31, 2025

Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, Heung-Yeung Shum

Abstract:We introduce Open-Reasoner-Zero, the first open source implementation of large-scale reasoning-oriented RL training focusing on scalability, simplicity and accessibility. Through extensive experiments, we demonstrate that a minimalist approach, vanilla PPO with GAE ($\lambda=1$, $\gamma=1$) and straightforward rule-based rewards, without any KL regularization, is sufficient to scale up both response length and benchmark performance, similar to the phenomenon observed in DeepSeek-R1-Zero. Using the same base model as DeepSeek-R1-Zero-Qwen-32B, our implementation achieves superior performance on AIME2024, MATH500, and the GPQA Diamond benchmark while demonstrating remarkable efficiency -- requiring only a tenth of the training steps, compared to DeepSeek-R1-Zero pipeline. In the spirit of open source, we release our source code, parameter settings, training data, and model weights across various sizes.

Via

Access Paper or Ask Questions

Multi-matrix Factorization Attention

Dec 26, 2024

Jingcheng Hu, Houyi Li, Yinmin Zhang, Zili Wang, Shuigeng Zhou, Xiangyu Zhang, Heung-Yeung Shum

Figure 1 for Multi-matrix Factorization Attention

Figure 2 for Multi-matrix Factorization Attention

Figure 3 for Multi-matrix Factorization Attention

Figure 4 for Multi-matrix Factorization Attention

Abstract:We propose novel attention architectures, Multi-matrix Factorization Attention (MFA) and MFA-Key-Reuse (MFA-KR). Existing variants for standard Multi-Head Attention (MHA), including SOTA methods like MLA, fail to maintain as strong performance under stringent Key-Value cache (KV cache) constraints. MFA enhances model capacity by efficiently scaling up both the number and dimension of attention heads through low-rank matrix factorization in the Query-Key (QK) circuit. Extending MFA, MFA-KR further reduces memory requirements by repurposing the key cache as value through value projection re-parameterization. MFA's design enables strong model capacity when working under tight KV cache budget, while MFA-KR is suitable for even harsher KV cache limits with minor performance trade-off. Notably, in our extensive and large-scale experiments, the proposed architecture outperforms MLA and performs comparably to MHA, while reducing KV cache usage by up to 56% and 93.7%, respectively.

Via

Access Paper or Ask Questions

Explore 3D Dance Generation via Reward Model from Automatically-Ranked Demonstrations

Dec 18, 2023

Zilin Wang, Haolin Zhuang, Lu Li, Yinmin Zhang, Junjie Zhong, Jun Chen, Yu Yang, Boshi Tang, Zhiyong Wu

Abstract:This paper presents an Exploratory 3D Dance generation framework, E3D2, designed to address the exploration capability deficiency in existing music-conditioned 3D dance generation models. Current models often generate monotonous and simplistic dance sequences that misalign with human preferences because they lack exploration capabilities. The E3D2 framework involves a reward model trained from automatically-ranked dance demonstrations, which then guides the reinforcement learning process. This approach encourages the agent to explore and generate high quality and diverse dance movement sequences. The soundness of the reward model is both theoretically and experimentally validated. Empirical experiments demonstrate the effectiveness of E3D2 on the AIST++ dataset. Project Page: https://sites.google.com/view/e3d2.

* AAAI-24

Via

Access Paper or Ask Questions

A Perspective of Q-value Estimation on Offline-to-Online Reinforcement Learning

Dec 12, 2023

Yinmin Zhang, Jie Liu, Chuming Li, Yazhe Niu, Yaodong Yang, Yu Liu, Wanli Ouyang

Figure 1 for A Perspective of Q-value Estimation on Offline-to-Online Reinforcement Learning

Figure 2 for A Perspective of Q-value Estimation on Offline-to-Online Reinforcement Learning

Figure 3 for A Perspective of Q-value Estimation on Offline-to-Online Reinforcement Learning

Figure 4 for A Perspective of Q-value Estimation on Offline-to-Online Reinforcement Learning

Abstract:Offline-to-online Reinforcement Learning (O2O RL) aims to improve the performance of offline pretrained policy using only a few online samples. Built on offline RL algorithms, most O2O methods focus on the balance between RL objective and pessimism, or the utilization of offline and online samples. In this paper, from a novel perspective, we systematically study the challenges that remain in O2O RL and identify that the reason behind the slow improvement of the performance and the instability of online finetuning lies in the inaccurate Q-value estimation inherited from offline pretraining. Specifically, we demonstrate that the estimation bias and the inaccurate rank of Q-value cause a misleading signal for the policy update, making the standard offline RL algorithms, such as CQL and TD3-BC, ineffective in the online finetuning. Based on this observation, we address the problem of Q-value estimation by two techniques: (1) perturbed value update and (2) increased frequency of Q-value updates. The first technique smooths out biased Q-value estimation with sharp peaks, preventing early-stage policy exploitation of sub-optimal actions. The second one alleviates the estimation bias inherited from offline pretraining by accelerating learning. Extensive experiments on the MuJoco and Adroit environments demonstrate that the proposed method, named SO2, significantly alleviates Q-value estimation issues, and consistently improves the performance against the state-of-the-art methods by up to 83.1%.

* Accepted at AAAI 2024

Via

Access Paper or Ask Questions

Masked Pretraining for Multi-Agent Decision Making

Oct 18, 2023

Jie Liu, Yinmin Zhang, Chuming Li, Chao Yang, Yaodong Yang, Yu Liu, Wanli Ouyang

Abstract:Building a single generalist agent with zero-shot capability has recently sparked significant advancements in decision-making. However, extending this capability to multi-agent scenarios presents challenges. Most current works struggle with zero-shot capabilities, due to two challenges particular to the multi-agent settings: a mismatch between centralized pretraining and decentralized execution, and varying agent numbers and action spaces, making it difficult to create generalizable representations across diverse downstream tasks. To overcome these challenges, we propose a \textbf{Mask}ed pretraining framework for \textbf{M}ulti-\textbf{a}gent decision making (MaskMA). This model, based on transformer architecture, employs a mask-based collaborative learning strategy suited for decentralized execution with partial observation. Moreover, MaskMA integrates a generalizable action representation by dividing the action space into actions toward self-information and actions related to other entities. This flexibility allows MaskMA to tackle tasks with varying agent numbers and thus different action spaces. Extensive experiments in SMAC reveal MaskMA, with a single model pretrained on 11 training maps, can achieve an impressive 77.8% zero-shot win rate on 60 unseen test maps by decentralized execution, while also performing effectively on other types of downstream tasks (\textit{e.g.,} varied policies collaboration and ad hoc team play).

* 17 pages

Via

Access Paper or Ask Questions

Towards Fair and Comprehensive Comparisons for Image-Based 3D Object Detection

Oct 11, 2023

Xinzhu Ma, Yongtao Wang, Yinmin Zhang, Zhiyi Xia, Yuan Meng, Zhihui Wang, Haojie Li, Wanli Ouyang

Figure 1 for Towards Fair and Comprehensive Comparisons for Image-Based 3D Object Detection

Figure 2 for Towards Fair and Comprehensive Comparisons for Image-Based 3D Object Detection

Figure 3 for Towards Fair and Comprehensive Comparisons for Image-Based 3D Object Detection

Figure 4 for Towards Fair and Comprehensive Comparisons for Image-Based 3D Object Detection

Abstract:In this work, we build a modular-designed codebase, formulate strong training recipes, design an error diagnosis toolbox, and discuss current methods for image-based 3D object detection. In particular, different from other highly mature tasks, e.g., 2D object detection, the community of image-based 3D object detection is still evolving, where methods often adopt different training recipes and tricks resulting in unfair evaluations and comparisons. What is worse, these tricks may overwhelm their proposed designs in performance, even leading to wrong conclusions. To address this issue, we build a module-designed codebase and formulate unified training standards for the community. Furthermore, we also design an error diagnosis toolbox to measure the detailed characterization of detection models. Using these tools, we analyze current methods in-depth under varying settings and provide discussions for some open questions, e.g., discrepancies in conclusions on KITTI-3D and nuScenes datasets, which have led to different dominant methods for these datasets. We hope that this work will facilitate future research in image-based 3D object detection. Our codes will be released at \url{https://github.com/OpenGVLab/3dodi}

* ICCV23, code will be released soon

Via

Access Paper or Ask Questions

Theoretically Guaranteed Policy Improvement Distilled from Model-Based Planning

Jul 24, 2023

Chuming Li, Ruonan Jia, Jie Liu, Yinmin Zhang, Yazhe Niu, Yaodong Yang, Yu Liu, Wanli Ouyang

Abstract:Model-based reinforcement learning (RL) has demonstrated remarkable successes on a range of continuous control tasks due to its high sample efficiency. To save the computation cost of conducting planning online, recent practices tend to distill optimized action sequences into an RL policy during the training phase. Although the distillation can incorporate both the foresight of planning and the exploration ability of RL policies, the theoretical understanding of these methods is yet unclear. In this paper, we extend the policy improvement step of Soft Actor-Critic (SAC) by developing an approach to distill from model-based planning to the policy. We then demonstrate that such an approach of policy improvement has a theoretical guarantee of monotonic improvement and convergence to the maximum value defined in SAC. We discuss effective design choices and implement our theory as a practical algorithm -- Model-based Planning Distilled to Policy (MPDP) -- that updates the policy jointly over multiple future time steps. Extensive experiments show that MPDP achieves better sample efficiency and asymptotic performance than both model-free and model-based planning algorithms on six continuous control benchmark tasks in MuJoCo.

Via

Access Paper or Ask Questions

ACE: Cooperative Multi-agent Q-learning with Bidirectional Action-Dependency

Dec 02, 2022

Chuming Li, Jie Liu, Yinmin Zhang, Yuhong Wei, Yazhe Niu, Yaodong Yang, Yu Liu, Wanli Ouyang

Figure 1 for ACE: Cooperative Multi-agent Q-learning with Bidirectional Action-Dependency

Figure 2 for ACE: Cooperative Multi-agent Q-learning with Bidirectional Action-Dependency

Figure 3 for ACE: Cooperative Multi-agent Q-learning with Bidirectional Action-Dependency

Figure 4 for ACE: Cooperative Multi-agent Q-learning with Bidirectional Action-Dependency

Abstract:Multi-agent reinforcement learning (MARL) suffers from the non-stationarity problem, which is the ever-changing targets at every iteration when multiple agents update their policies at the same time. Starting from first principle, in this paper, we manage to solve the non-stationarity problem by proposing bidirectional action-dependent Q-learning (ACE). Central to the development of ACE is the sequential decision-making process wherein only one agent is allowed to take action at one time. Within this process, each agent maximizes its value function given the actions taken by the preceding agents at the inference stage. In the learning phase, each agent minimizes the TD error that is dependent on how the subsequent agents have reacted to their chosen action. Given the design of bidirectional dependency, ACE effectively turns a multiagent MDP into a single-agent MDP. We implement the ACE framework by identifying the proper network representation to formulate the action dependency, so that the sequential decision process is computed implicitly in one forward pass. To validate ACE, we compare it with strong baselines on two MARL benchmarks. Empirical experiments demonstrate that ACE outperforms the state-of-the-art algorithms on Google Research Football and StarCraft Multi-Agent Challenge by a large margin. In particular, on SMAC tasks, ACE achieves 100% success rate on almost all the hard and super-hard maps. We further study extensive research problems regarding ACE, including extension, generalization, and practicability. Code is made available to facilitate further research.

* Accepted by the Thirty-Seventh AAAI Conference on Artificial Intelligence(AAAI2023)

Via

Access Paper or Ask Questions

An Empirical Study of Pseudo-Labeling for Image-based 3D Object Detection

Aug 15, 2022

Xinzhu Ma, Yuan Meng, Yinmin Zhang, Lei Bai, Jun Hou, Shuai Yi, Wanli Ouyang

Figure 1 for An Empirical Study of Pseudo-Labeling for Image-based 3D Object Detection

Figure 2 for An Empirical Study of Pseudo-Labeling for Image-based 3D Object Detection

Figure 3 for An Empirical Study of Pseudo-Labeling for Image-based 3D Object Detection

Figure 4 for An Empirical Study of Pseudo-Labeling for Image-based 3D Object Detection

Abstract:Image-based 3D detection is an indispensable component of the perception system for autonomous driving. However, it still suffers from the unsatisfying performance, one of the main reasons for which is the limited training data. Unfortunately, annotating the objects in the 3D space is extremely time/resource-consuming, which makes it hard to extend the training set arbitrarily. In this work, we focus on the semi-supervised manner and explore the feasibility of a cheaper alternative, i.e. pseudo-labeling, to leverage the unlabeled data. For this purpose, we conduct extensive experiments to investigate whether the pseudo-labels can provide effective supervision for the baseline models under varying settings. The experimental results not only demonstrate the effectiveness of the pseudo-labeling mechanism for image-based 3D detection (e.g. under monocular setting, we achieve 20.23 AP for moderate level on the KITTI-3D testing set without bells and whistles, improving the baseline model by 6.03 AP), but also show several interesting and noteworthy findings (e.g. the models trained with pseudo-labels perform better than that trained with ground-truth annotations based on the same training data). We hope this work can provide insights for the image-based 3D detection community under a semi-supervised setting. The codes, pseudo-labels, and pre-trained models will be publicly available.

* tech report

Via

Access Paper or Ask Questions

Learning Geometry-Guided Depth via Projective Modeling for Monocular 3D Object Detection

Jul 29, 2021

Yinmin Zhang, Xinzhu Ma, Shuai Yi, Jun Hou, Zhihui Wang, Wanli Ouyang, Dan Xu

Figure 1 for Learning Geometry-Guided Depth via Projective Modeling for Monocular 3D Object Detection

Figure 2 for Learning Geometry-Guided Depth via Projective Modeling for Monocular 3D Object Detection

Figure 3 for Learning Geometry-Guided Depth via Projective Modeling for Monocular 3D Object Detection

Figure 4 for Learning Geometry-Guided Depth via Projective Modeling for Monocular 3D Object Detection

Abstract:As a crucial task of autonomous driving, 3D object detection has made great progress in recent years. However, monocular 3D object detection remains a challenging problem due to the unsatisfactory performance in depth estimation. Most existing monocular methods typically directly regress the scene depth while ignoring important relationships between the depth and various geometric elements (e.g. bounding box sizes, 3D object dimensions, and object poses). In this paper, we propose to learn geometry-guided depth estimation with projective modeling to advance monocular 3D object detection. Specifically, a principled geometry formula with projective modeling of 2D and 3D depth predictions in the monocular 3D object detection network is devised. We further implement and embed the proposed formula to enable geometry-aware deep representation learning, allowing effective 2D and 3D interactions for boosting the depth estimation. Moreover, we provide a strong baseline through addressing substantial misalignment between 2D annotation and projected boxes to ensure robust learning with the proposed geometric formula. Experiments on the KITTI dataset show that our method remarkably improves the detection performance of the state-of-the-art monocular-based method without extra data by 2.80% on the moderate test setting. The model and code will be released at https://github.com/YinminZhang/MonoGeo.

* 16 pages, 11 figures

Via

Access Paper or Ask Questions