Abstract:The importance of Reinforcement Learning from Human Feedback (RLHF) in aligning large language models (LLMs) with human values cannot be overstated. RLHF is a three-stage process that includes supervised fine-tuning (SFT), reward learning, and policy learning. Although there are several offline and online approaches to aligning LLMs, they often suffer from distribution shift issues. These issues arise from the inability to accurately capture the distributional interdependence between the reward learning and policy learning stages. Consequently, this has led to various approximated approaches, but the theoretical insights and motivations remain largely limited to tabular settings, which do not hold in practice. This gap between theoretical insights and practical implementations is critical. It is challenging to address this gap as it requires analyzing the performance of AI alignment algorithms in neural network-parameterized settings. Although bi-level formulations have shown promise in addressing distribution shift issues, they suffer from the hyper-gradient problem, and current approaches lack efficient algorithms to solve this. In this work, we tackle these challenges employing the bi-level formulation laid out in Kwon et al. (2024) along with the assumption \emph{Weak Gradient Domination} to demonstrate convergence in an RLHF setup, obtaining a sample complexity of $\epsilon^{-\frac{7}{2}}$ . Our key contributions are twofold: (i) We propose a bi-level formulation for AI alignment in parameterized settings and introduce a first-order approach to solve this problem. (ii) We analyze the theoretical convergence rates of the proposed algorithm and derive state-of-the-art bounds. To the best of our knowledge, this is the first work to establish convergence rate bounds and global optimality for the RLHF framework in neural network-parameterized settings.
Abstract:By framing reinforcement learning as a sequence modeling problem, recent work has enabled the use of generative models, such as diffusion models, for planning. While these models are effective in predicting long-horizon state trajectories in deterministic environments, they face challenges in dynamic settings with moving obstacles. Effective collision avoidance demands continuous monitoring and adaptive decision-making. While replanning at every timestep could ensure safety, it introduces substantial computational overhead due to the repetitive prediction of overlapping state sequences -- a process that is particularly costly with diffusion models, known for their intensive iterative sampling procedure. We propose an adaptive generative planning approach that dynamically adjusts replanning frequency based on the uncertainty of action predictions. Our method minimizes the need for frequent, computationally expensive, and redundant replanning while maintaining robust collision avoidance performance. In experiments, we obtain a 13.5% increase in the mean trajectory length and a 12.7% increase in mean reward over long-horizon planning, indicating a reduction in collision rates and an improved ability to navigate the environment safely.
Abstract:We consider the problem of learning a Constrained Markov Decision Process (CMDP) via general parameterization. Our proposed Primal-Dual based Regularized Accelerated Natural Policy Gradient (PDR-ANPG) algorithm uses entropy and quadratic regularizers to reach this goal. For a parameterized policy class with transferred compatibility approximation error, $\epsilon_{\mathrm{bias}}$, PDR-ANPG achieves a last-iterate $\epsilon$ optimality gap and $\epsilon$ constraint violation (up to some additive factor of $\epsilon_{\mathrm{bias}}$) with a sample complexity of $\tilde{\mathcal{O}}(\epsilon^{-2}\min\{\epsilon^{-2},\epsilon_{\mathrm{bias}}^{-\frac{1}{3}}\})$. If the class is incomplete ($\epsilon_{\mathrm{bias}}>0$), then the sample complexity reduces to $\tilde{\mathcal{O}}(\epsilon^{-2})$ for $\epsilon<(\epsilon_{\mathrm{bias}})^{\frac{1}{6}}$. Moreover, for complete policies with $\epsilon_{\mathrm{bias}}=0$, our algorithm achieves a last-iterate $\epsilon$ optimality gap and $\epsilon$ constraint violation with $\tilde{\mathcal{O}}(\epsilon^{-4})$ sample complexity. It is a significant improvement of the state-of-the-art last-iterate guarantees of general parameterized CMDPs.
Abstract:Non-local operations play a crucial role in computer vision enabling the capture of long-range dependencies through weighted sums of features across the input, surpassing the constraints of traditional convolution operations that focus solely on local neighborhoods. Non-local operations typically require computing pairwise relationships between all elements in a set, leading to quadratic complexity in terms of time and memory. Due to the high computational and memory demands, scaling non-local neural networks to large-scale problems can be challenging. This article introduces a hybrid quantum-classical scalable non-local neural network, referred to as Quantum Non-Local Neural Network (QNL-Net), to enhance pattern recognition. The proposed QNL-Net relies on inherent quantum parallelism to allow the simultaneous processing of a large number of input features enabling more efficient computations in quantum-enhanced feature space and involving pairwise relationships through quantum entanglement. We benchmark our proposed QNL-Net with other quantum counterparts to binary classification with datasets MNIST and CIFAR-10. The simulation findings showcase our QNL-Net achieves cutting-edge accuracy levels in binary image classification among quantum classifiers while utilizing fewer qubits.
Abstract:In our study, we delve into average-reward reinforcement learning with general policy parametrization. Within this domain, current guarantees either fall short with suboptimal guarantees or demand prior knowledge of mixing time. To address these issues, we introduce Randomized Accelerated Natural Actor Critic, a method that integrates Multi-level Monte-Carlo and Natural Actor Critic. Our approach is the first to achieve global convergence rate of $\tilde{\mathcal{O}}(1/\sqrt{T})$ without requiring knowledge of mixing time, significantly surpassing the state-of-the-art bound of $\tilde{\mathcal{O}}(1/T^{1/4})$.
Abstract:Skills are effective temporal abstractions established for sequential decision making tasks, which enable efficient hierarchical learning for long-horizon tasks and facilitate multi-task learning through their transferability. Despite extensive research, research gaps remain in multi-agent scenarios, particularly for automatically extracting subgroup coordination patterns in a multi-agent task. In this case, we propose two novel auto-encoder schemes: VO-MASD-3D and VO-MASD-Hier, to simultaneously capture subgroup- and temporal-level abstractions and form multi-agent skills, which firstly solves the aforementioned challenge. An essential algorithm component of these schemes is a dynamic grouping function that can automatically detect latent subgroups based on agent interactions in a task. Notably, our method can be applied to offline multi-task data, and the discovered subgroup skills can be transferred across relevant tasks without retraining. Empirical evaluations on StarCraft tasks indicate that our approach significantly outperforms existing methods regarding applying skills in multi-agent reinforcement learning (MARL). Moreover, skills discovered using our method can effectively reduce the learning difficulty in MARL scenarios with delayed and sparse reward signals.
Abstract:We consider a constrained Markov Decision Problem (CMDP) where the goal of an agent is to maximize the expected discounted sum of rewards over an infinite horizon while ensuring that the expected discounted sum of costs exceeds a certain threshold. Building on the idea of momentum-based acceleration, we develop the Primal-Dual Accelerated Natural Policy Gradient (PD-ANPG) algorithm that guarantees an $\epsilon$ global optimality gap and $\epsilon$ constraint violation with $\mathcal{O}(\epsilon^{-3})$ sample complexity. This improves the state-of-the-art sample complexity in CMDP by a factor of $\mathcal{O}(\epsilon^{-1})$.
Abstract:In complex environments with large discrete action spaces, effective decision-making is critical in reinforcement learning (RL). Despite the widespread use of value-based RL approaches like Q-learning, they come with a computational burden, necessitating the maximization of a value function over all actions in each iteration. This burden becomes particularly challenging when addressing large-scale problems and using deep neural networks as function approximators. In this paper, we present stochastic value-based RL approaches which, in each iteration, as opposed to optimizing over the entire set of $n$ actions, only consider a variable stochastic set of a sublinear number of actions, possibly as small as $\mathcal{O}(\log(n))$. The presented stochastic value-based RL methods include, among others, Stochastic Q-learning, StochDQN, and StochDDQN, all of which integrate this stochastic approach for both value-function updates and action selection. The theoretical convergence of Stochastic Q-learning is established, while an analysis of stochastic maximization is provided. Moreover, through empirical validation, we illustrate that the various proposed approaches outperform the baseline methods across diverse environments, including different control problems, achieving near-optimal average returns in significantly reduced time.
Abstract:This paper introduces a federated learning framework tailored for online combinatorial optimization with bandit feedback. In this setting, agents select subsets of arms, observe noisy rewards for these subsets without accessing individual arm information, and can cooperate and share information at specific intervals. Our framework transforms any offline resilient single-agent $(\alpha-\epsilon)$-approximation algorithm, having a complexity of $\tilde{\mathcal{O}}(\frac{\psi}{\epsilon^\beta})$, where the logarithm is omitted, for some function $\psi$ and constant $\beta$, into an online multi-agent algorithm with $m$ communicating agents and an $\alpha$-regret of no more than $\tilde{\mathcal{O}}(m^{-\frac{1}{3+\beta}} \psi^\frac{1}{3+\beta} T^\frac{2+\beta}{3+\beta})$. This approach not only eliminates the $\epsilon$ approximation error but also ensures sublinear growth with respect to the time horizon $T$ and demonstrates a linear speedup with an increasing number of communicating agents. Additionally, the algorithm is notably communication-efficient, requiring only a sublinear number of communication rounds, quantified as $\tilde{\mathcal{O}}\left(\psi T^\frac{\beta}{\beta+1}\right)$. Furthermore, the framework has been successfully applied to online stochastic submodular maximization using various offline algorithms, yielding the first results for both single-agent and multi-agent settings and recovering specialized single-agent theoretical guarantees. We empirically validate our approach to a stochastic data summarization problem, illustrating the effectiveness of the proposed framework, even in single-agent scenarios.
Abstract:The current state-of-the-art theoretical analysis of Actor-Critic (AC) algorithms significantly lags in addressing the practical aspects of AC implementations. This crucial gap needs bridging to bring the analysis in line with practical implementations of AC. To address this, we advocate for considering the MMCLG criteria: \textbf{M}ulti-layer neural network parametrization for actor/critic, \textbf{M}arkovian sampling, \textbf{C}ontinuous state-action spaces, the performance of the \textbf{L}ast iterate, and \textbf{G}lobal optimality. These aspects are practically significant and have been largely overlooked in existing theoretical analyses of AC algorithms. In this work, we address these gaps by providing the first comprehensive theoretical analysis of AC algorithms that encompasses all five crucial practical aspects (covers MMCLG criteria). We establish global convergence sample complexity bounds of $\tilde{\mathcal{O}}\left({\epsilon^{-3}}\right)$. We achieve this result through our novel use of the weak gradient domination property of MDP's and our unique analysis of the error in critic estimation.