Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Erdem Bıyık

Actor-Free Continuous Control via Structurally Maximizable Q-Functions

Oct 21, 2025

Yigit Korkmaz, Urvi Bhuwania, Ayush Jain, Erdem Bıyık

Abstract:Value-based algorithms are a cornerstone of off-policy reinforcement learning due to their simplicity and training stability. However, their use has traditionally been restricted to discrete action spaces, as they rely on estimating Q-values for individual state-action pairs. In continuous action spaces, evaluating the Q-value over the entire action space becomes computationally infeasible. To address this, actor-critic methods are typically employed, where a critic is trained on off-policy data to estimate Q-values, and an actor is trained to maximize the critic's output. Despite their popularity, these methods often suffer from instability during training. In this work, we propose a purely value-based framework for continuous control that revisits structural maximization of Q-functions, introducing a set of key architectural and algorithmic choices to enable efficient and stable learning. We evaluate the proposed actor-free Q-learning approach on a range of standard simulation tasks, demonstrating performance and sample efficiency on par with state-of-the-art baselines, without the cost of learning a separate actor. Particularly, in environments with constrained action spaces, where the value functions are typically non-smooth, our method with structural maximization outperforms traditional actor-critic methods with gradient-based maximization. We have released our code at https://github.com/USC-Lira/Q3C.

* 39th Conference on Neural Information Processing Systems (NeurIPS 2025)

Via

Access Paper or Ask Questions

HAND Me the Data: Fast Robot Adaptation via Hand Path Retrieval

May 28, 2025

Matthew Hong, Anthony Liang, Kevin Kim, Harshitha Rajaprakash, Jesse Thomason, Erdem Bıyık, Jesse Zhang

Figure 1 for HAND Me the Data: Fast Robot Adaptation via Hand Path Retrieval

Figure 2 for HAND Me the Data: Fast Robot Adaptation via Hand Path Retrieval

Figure 3 for HAND Me the Data: Fast Robot Adaptation via Hand Path Retrieval

Figure 4 for HAND Me the Data: Fast Robot Adaptation via Hand Path Retrieval

Abstract:We hand the community HAND, a simple and time-efficient method for teaching robots new manipulation tasks through human hand demonstrations. Instead of relying on task-specific robot demonstrations collected via teleoperation, HAND uses easy-to-provide hand demonstrations to retrieve relevant behaviors from task-agnostic robot play data. Using a visual tracking pipeline, HAND extracts the motion of the human hand from the hand demonstration and retrieves robot sub-trajectories in two stages: first filtering by visual similarity, then retrieving trajectories with similar behaviors to the hand. Fine-tuning a policy on the retrieved data enables real-time learning of tasks in under four minutes, without requiring calibrated cameras or detailed hand pose estimation. Experiments also show that HAND outperforms retrieval baselines by over 2x in average task success rates on real robots. Videos can be found at our project website: https://liralab.usc.edu/handretrieval/.

Via

Access Paper or Ask Questions

IMPACT: Intelligent Motion Planning with Acceptable Contact Trajectories via Vision-Language Models

Mar 13, 2025

Yiyang Ling, Karan Owalekar, Oluwatobiloba Adesanya, Erdem Bıyık, Daniel Seita

Figure 1 for IMPACT: Intelligent Motion Planning with Acceptable Contact Trajectories via Vision-Language Models

Figure 2 for IMPACT: Intelligent Motion Planning with Acceptable Contact Trajectories via Vision-Language Models

Figure 3 for IMPACT: Intelligent Motion Planning with Acceptable Contact Trajectories via Vision-Language Models

Figure 4 for IMPACT: Intelligent Motion Planning with Acceptable Contact Trajectories via Vision-Language Models

Abstract:Motion planning involves determining a sequence of robot configurations to reach a desired pose, subject to movement and safety constraints. Traditional motion planning finds collision-free paths, but this is overly restrictive in clutter, where it may not be possible for a robot to accomplish a task without contact. In addition, contacts range from relatively benign (e.g., brushing a soft pillow) to more dangerous (e.g., toppling a glass vase). Due to this diversity, it is difficult to characterize which contacts may be acceptable or unacceptable. In this paper, we propose IMPACT, a novel motion planning framework that uses Vision-Language Models (VLMs) to infer environment semantics, identifying which parts of the environment can best tolerate contact based on object properties and locations. Our approach uses the VLM's outputs to produce a dense 3D "cost map" that encodes contact tolerances and seamlessly integrates with standard motion planners. We perform experiments using 20 simulation and 10 real-world scenes and assess using task success rate, object displacements, and feedback from human evaluators. Our results over 3620 simulation and 200 real-world trials suggest that IMPACT enables efficient contact-rich motion planning in cluttered settings while outperforming alternative methods and ablations. Supplementary material is available at https://impact-planning.github.io/.

Via

Access Paper or Ask Questions

Multi-Agent Inverse Q-Learning from Demonstrations

Mar 06, 2025

Nathaniel Haynam, Adam Khoja, Dhruv Kumar, Vivek Myers, Erdem Bıyık

Figure 1 for Multi-Agent Inverse Q-Learning from Demonstrations

Figure 2 for Multi-Agent Inverse Q-Learning from Demonstrations

Figure 3 for Multi-Agent Inverse Q-Learning from Demonstrations

Figure 4 for Multi-Agent Inverse Q-Learning from Demonstrations

Abstract:When reward functions are hand-designed, deep reinforcement learning algorithms often suffer from reward misspecification, causing them to learn suboptimal policies in terms of the intended task objectives. In the single-agent case, inverse reinforcement learning (IRL) techniques attempt to address this issue by inferring the reward function from expert demonstrations. However, in multi-agent problems, misalignment between the learned and true objectives is exacerbated due to increased environment non-stationarity and variance that scales with multiple agents. As such, in multi-agent general-sum games, multi-agent IRL algorithms have difficulty balancing cooperative and competitive objectives. To address these issues, we propose Multi-Agent Marginal Q-Learning from Demonstrations (MAMQL), a novel sample-efficient framework for multi-agent IRL. For each agent, MAMQL learns a critic marginalized over the other agents' policies, allowing for a well-motivated use of Boltzmann policies in the multi-agent context. We identify a connection between optimal marginalized critics and single-agent soft-Q IRL, allowing us to apply a direct, simple optimization criterion from the single-agent domain. Across our experiments on three different simulated domains, MAMQL significantly outperforms previous multi-agent methods in average reward, sample efficiency, and reward recovery by often more than 2-5x. We make our code available at https://sites.google.com/view/mamql .

* 8 pages, 4 figures, 2 tables. Published at the International Conference on Robotics and Automation (ICRA) 2025

Via

Access Paper or Ask Questions

RAILGUN: A Unified Convolutional Policy for Multi-Agent Path Finding Across Different Environments and Tasks

Mar 04, 2025

Yimin Tang, Xiao Xiong, Jingyi Xi, Jiaoyang Li, Erdem Bıyık, Sven Koenig

Abstract:Multi-Agent Path Finding (MAPF), which focuses on finding collision-free paths for multiple robots, is crucial for applications ranging from aerial swarms to warehouse automation. Solving MAPF is NP-hard so learning-based approaches for MAPF have gained attention, particularly those leveraging deep neural networks. Nonetheless, despite the community's continued efforts, all learning-based MAPF planners still rely on decentralized planning due to variability in the number of agents and map sizes. We have developed the first centralized learning-based policy for MAPF problem called RAILGUN. RAILGUN is not an agent-based policy but a map-based policy. By leveraging a CNN-based architecture, RAILGUN can generalize across different maps and handle any number of agents. We collect trajectories from rule-based methods to train our model in a supervised way. In experiments, RAILGUN outperforms most baseline methods and demonstrates great zero-shot generalization capabilities on various tasks, maps and agent numbers that were not seen in the training dataset.

* 7 pages

Via

Access Paper or Ask Questions

MILE: Model-based Intervention Learning

Feb 19, 2025

Yigit Korkmaz, Erdem Bıyık

Figure 1 for MILE: Model-based Intervention Learning

Figure 2 for MILE: Model-based Intervention Learning

Figure 3 for MILE: Model-based Intervention Learning

Figure 4 for MILE: Model-based Intervention Learning

Abstract:Imitation learning techniques have been shown to be highly effective in real-world control scenarios, such as robotics. However, these approaches not only suffer from compounding error issues but also require human experts to provide complete trajectories. Although there exist interactive methods where an expert oversees the robot and intervenes if needed, these extensions usually only utilize the data collected during intervention periods and ignore the feedback signal hidden in non-intervention timesteps. In this work, we create a model to formulate how the interventions occur in such cases, and show that it is possible to learn a policy with just a handful of expert interventions. Our key insight is that it is possible to get crucial information about the quality of the current state and the optimality of the chosen action from expert feedback, regardless of the presence or the absence of intervention. We evaluate our method on various discrete and continuous simulation environments, a real-world robotic manipulation task, as well as a human subject study. Videos and the code can be found at https://liralab.usc.edu/mile .

* International Conference on Robotics and Automation (ICRA)

Via

Access Paper or Ask Questions

NaVILA: Legged Robot Vision-Language-Action Model for Navigation

Dec 05, 2024

An-Chieh Cheng, Yandong Ji, Zhaojing Yang, Xueyan Zou, Jan Kautz, Erdem Bıyık, Hongxu Yin, Sifei Liu, Xiaolong Wang

Figure 1 for NaVILA: Legged Robot Vision-Language-Action Model for Navigation

Figure 2 for NaVILA: Legged Robot Vision-Language-Action Model for Navigation

Figure 3 for NaVILA: Legged Robot Vision-Language-Action Model for Navigation

Figure 4 for NaVILA: Legged Robot Vision-Language-Action Model for Navigation

Abstract:This paper proposes to solve the problem of Vision-and-Language Navigation with legged robots, which not only provides a flexible way for humans to command but also allows the robot to navigate through more challenging and cluttered scenes. However, it is non-trivial to translate human language instructions all the way to low-level leg joint actions. We propose NaVILA, a 2-level framework that unifies a Vision-Language-Action model (VLA) with locomotion skills. Instead of directly predicting low-level actions from VLA, NaVILA first generates mid-level actions with spatial information in the form of language, (e.g., "moving forward 75cm"), which serves as an input for a visual locomotion RL policy for execution. NaVILA substantially improves previous approaches on existing benchmarks. The same advantages are demonstrated in our newly developed benchmarks with IsaacLab, featuring more realistic scenes, low-level controls, and real-world robot experiments. We show more results at https://navila-bot.github.io/

* Website: https://navila-bot.github.io/

Via

Access Paper or Ask Questions

Accurate and Data-Efficient Toxicity Prediction when Annotators Disagree

Oct 16, 2024

Harbani Jaggi, Kashyap Murali, Eve Fleisig, Erdem Bıyık

Figure 1 for Accurate and Data-Efficient Toxicity Prediction when Annotators Disagree

Figure 2 for Accurate and Data-Efficient Toxicity Prediction when Annotators Disagree

Figure 3 for Accurate and Data-Efficient Toxicity Prediction when Annotators Disagree

Figure 4 for Accurate and Data-Efficient Toxicity Prediction when Annotators Disagree

Abstract:When annotators disagree, predicting the labels given by individual annotators can capture nuances overlooked by traditional label aggregation. We introduce three approaches to predicting individual annotator ratings on the toxicity of text by incorporating individual annotator-specific information: a neural collaborative filtering (NCF) approach, an in-context learning (ICL) approach, and an intermediate embedding-based architecture. We also study the utility of demographic information for rating prediction. NCF showed limited utility; however, integrating annotator history, demographics, and survey information permits both the embedding-based architecture and ICL to substantially improve prediction accuracy, with the embedding-based architecture outperforming the other methods. We also find that, if demographics are predicted from survey information, using these imputed demographics as features performs comparably to using true demographic data. This suggests that demographics may not provide substantial information for modeling ratings beyond what is captured in survey responses. Our findings raise considerations about the relative utility of different types of annotator information and provide new approaches for modeling annotators in subjective NLP tasks.

Via

Access Paper or Ask Questions

Mitigating Suboptimality of Deterministic Policy Gradients in Complex Q-functions

Oct 15, 2024

Ayush Jain, Norio Kosaka, Xinhu Li, Kyung-Min Kim, Erdem Bıyık, Joseph J. Lim

Figure 1 for Mitigating Suboptimality of Deterministic Policy Gradients in Complex Q-functions

Figure 2 for Mitigating Suboptimality of Deterministic Policy Gradients in Complex Q-functions

Figure 3 for Mitigating Suboptimality of Deterministic Policy Gradients in Complex Q-functions

Figure 4 for Mitigating Suboptimality of Deterministic Policy Gradients in Complex Q-functions

Abstract:In reinforcement learning, off-policy actor-critic approaches like DDPG and TD3 are based on the deterministic policy gradient. Herein, the Q-function is trained from off-policy environment data and the actor (policy) is trained to maximize the Q-function via gradient ascent. We observe that in complex tasks like dexterous manipulation and restricted locomotion, the Q-value is a complex function of action, having several local optima or discontinuities. This poses a challenge for gradient ascent to traverse and makes the actor prone to get stuck at local optima. To address this, we introduce a new actor architecture that combines two simple insights: (i) use multiple actors and evaluate the Q-value maximizing action, and (ii) learn surrogates to the Q-function that are simpler to optimize with gradient-based methods. We evaluate tasks such as restricted locomotion, dexterous manipulation, and large discrete-action space recommender systems and show that our actor finds optimal actions more frequently and outperforms alternate actor architectures.

Via

Access Paper or Ask Questions

Trajectory Improvement and Reward Learning from Comparative Language Feedback

Oct 08, 2024

Zhaojing Yang, Miru Jun, Jeremy Tien, Stuart J. Russell, Anca Dragan, Erdem Bıyık

Figure 1 for Trajectory Improvement and Reward Learning from Comparative Language Feedback

Figure 2 for Trajectory Improvement and Reward Learning from Comparative Language Feedback

Figure 3 for Trajectory Improvement and Reward Learning from Comparative Language Feedback

Figure 4 for Trajectory Improvement and Reward Learning from Comparative Language Feedback

Abstract:Learning from human feedback has gained traction in fields like robotics and natural language processing in recent years. While prior works mostly rely on human feedback in the form of comparisons, language is a preferable modality that provides more informative insights into user preferences. In this work, we aim to incorporate comparative language feedback to iteratively improve robot trajectories and to learn reward functions that encode human preferences. To achieve this goal, we learn a shared latent space that integrates trajectory data and language feedback, and subsequently leverage the learned latent space to improve trajectories and learn human preferences. To the best of our knowledge, we are the first to incorporate comparative language feedback into reward learning. Our simulation experiments demonstrate the effectiveness of the learned latent space and the success of our learning algorithms. We also conduct human subject studies that show our reward learning algorithm achieves a 23.9% higher subjective score on average and is 11.3% more time-efficient compared to preference-based reward learning, underscoring the superior performance of our method. Our website is at https://liralab.usc.edu/comparative-language-feedback/

* 8th Annual Conference of Robot Learning (2024)

Via

Access Paper or Ask Questions