Abstract:The importance of Reinforcement Learning from Human Feedback (RLHF) in aligning large language models (LLMs) with human values cannot be overstated. RLHF is a three-stage process that includes supervised fine-tuning (SFT), reward learning, and policy learning. Although there are several offline and online approaches to aligning LLMs, they often suffer from distribution shift issues. These issues arise from the inability to accurately capture the distributional interdependence between the reward learning and policy learning stages. Consequently, this has led to various approximated approaches, but the theoretical insights and motivations remain largely limited to tabular settings, which do not hold in practice. This gap between theoretical insights and practical implementations is critical. It is challenging to address this gap as it requires analyzing the performance of AI alignment algorithms in neural network-parameterized settings. Although bi-level formulations have shown promise in addressing distribution shift issues, they suffer from the hyper-gradient problem, and current approaches lack efficient algorithms to solve this. In this work, we tackle these challenges employing the bi-level formulation laid out in Kwon et al. (2024) along with the assumption \emph{Weak Gradient Domination} to demonstrate convergence in an RLHF setup, obtaining a sample complexity of $\epsilon^{-\frac{7}{2}}$ . Our key contributions are twofold: (i) We propose a bi-level formulation for AI alignment in parameterized settings and introduce a first-order approach to solve this problem. (ii) We analyze the theoretical convergence rates of the proposed algorithm and derive state-of-the-art bounds. To the best of our knowledge, this is the first work to establish convergence rate bounds and global optimality for the RLHF framework in neural network-parameterized settings.
Abstract:The current state-of-the-art theoretical analysis of Actor-Critic (AC) algorithms significantly lags in addressing the practical aspects of AC implementations. This crucial gap needs bridging to bring the analysis in line with practical implementations of AC. To address this, we advocate for considering the MMCLG criteria: \textbf{M}ulti-layer neural network parametrization for actor/critic, \textbf{M}arkovian sampling, \textbf{C}ontinuous state-action spaces, the performance of the \textbf{L}ast iterate, and \textbf{G}lobal optimality. These aspects are practically significant and have been largely overlooked in existing theoretical analyses of AC algorithms. In this work, we address these gaps by providing the first comprehensive theoretical analysis of AC algorithms that encompasses all five crucial practical aspects (covers MMCLG criteria). We establish global convergence sample complexity bounds of $\tilde{\mathcal{O}}\left({\epsilon^{-3}}\right)$. We achieve this result through our novel use of the weak gradient domination property of MDP's and our unique analysis of the error in critic estimation.
Abstract:Actor-critic algorithms have shown remarkable success in solving state-of-the-art decision-making problems. However, despite their empirical effectiveness, their theoretical underpinnings remain relatively unexplored, especially with neural network parametrization. In this paper, we delve into the study of a natural actor-critic algorithm that utilizes neural networks to represent the critic. Our aim is to establish sample complexity guarantees for this algorithm, achieving a deeper understanding of its performance characteristics. To achieve that, we propose a Natural Actor-Critic algorithm with 2-Layer critic parametrization (NAC2L). Our approach involves estimating the $Q$-function in each iteration through a convex optimization problem. We establish that our proposed approach attains a sample complexity of $\tilde{\mathcal{O}}\left(\frac{1}{\epsilon^{4}(1-\gamma)^{4}}\right)$. In contrast, the existing sample complexity results in the literature only hold for a tabular or linear MDP. Our result, on the other hand, holds for countable state spaces and does not require a linear or low-rank structure on the MDP.
Abstract:Deep Q-learning based algorithms have been applied successfully in many decision making problems, while their theoretical foundations are not as well understood. In this paper, we study a Fitted Q-Iteration with two-layer ReLU neural network parametrization, and find the sample complexity guarantees for the algorithm. The approach estimates the Q-function in each iteration using a convex optimization problem. We show that this approach achieves a sample complexity of $\tilde{\mathcal{O}}(1/\epsilon^{2})$, which is order-optimal. This result holds for a countable state-space and does not require any assumptions such as a linear or low rank structure on the MDP.