Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sheila A. McIlraith

Multi-Agent Verification: Scaling Test-Time Compute with Multiple Verifiers

Feb 27, 2025

Shalev Lifshitz, Sheila A. McIlraith, Yilun Du

Abstract:By utilizing more computational resources at test-time, large language models (LLMs) can improve without additional training. One common strategy uses verifiers to evaluate candidate outputs. In this work, we propose a novel scaling dimension for test-time compute: scaling the number of verifiers. We introduce Multi-Agent Verification (MAV) as a test-time compute paradigm that combines multiple verifiers to improve performance. We propose using Aspect Verifiers (AVs), off-the-shelf LLMs prompted to verify different aspects of outputs, as one possible choice for the verifiers in a MAV system. AVs are a convenient building block for MAV since they can be easily combined without additional training. Moreover, we introduce BoN-MAV, a simple multi-agent verification algorithm that combines best-of-n sampling with multiple verifiers. BoN-MAV demonstrates stronger scaling patterns than self-consistency and reward model verification, and we demonstrate both weak-to-strong generalization, where combining weak verifiers improves even stronger LLMs, and self-improvement, where the same base model is used to both generate and verify outputs. Our results establish scaling the number of verifiers as a promising new dimension for improving language model performance at test-time.

Via

Access Paper or Ask Questions

Pluralistic Alignment Over Time

Nov 16, 2024

Toryn Q. Klassen, Parand A. Alamdari, Sheila A. McIlraith

Figure 1 for Pluralistic Alignment Over Time

Figure 2 for Pluralistic Alignment Over Time

Abstract:If an AI system makes decisions over time, how should we evaluate how aligned it is with a group of stakeholders (who may have conflicting values and preferences)? In this position paper, we advocate for consideration of temporal aspects including stakeholders' changing levels of satisfaction and their possibly temporally extended preferences. We suggest how a recent approach to evaluating fairness over time could be applied to a new form of pluralistic alignment: temporal pluralism, where the AI system reflects different stakeholders' values at different times.

* Pluralistic Alignment Workshop at NeurIPS 2024

Via

Access Paper or Ask Questions

Being Considerate as a Pathway Towards Pluralistic Alignment for Agentic AI

Nov 15, 2024

Parand A. Alamdari, Toryn Q. Klassen, Rodrigo Toro Icarte, Sheila A. McIlraith

Abstract:Pluralistic alignment is concerned with ensuring that an AI system's objectives and behaviors are in harmony with the diversity of human values and perspectives. In this paper we study the notion of pluralistic alignment in the context of agentic AI, and in particular in the context of an agent that is trying to learn a policy in a manner that is mindful of the values and perspective of others in the environment. To this end, we show how being considerate of the future wellbeing and agency of other (human) agents can promote a form of pluralistic alignment.

* Pluralistic Alignment Workshop at NeurIPS 2024

Via

Access Paper or Ask Questions

Reward Machines for Deep RL in Noisy and Uncertain Environments

May 31, 2024

Andrew C. Li, Zizhao Chen, Toryn Q. Klassen, Pashootan Vaezipoor, Rodrigo Toro Icarte, Sheila A. McIlraith

Abstract:Reward Machines provide an automata-inspired structure for specifying instructions, safety constraints, and other temporally extended reward-worthy behaviour. By exposing complex reward function structure, they enable counterfactual learning updates that have resulted in impressive sample efficiency gains. While Reward Machines have been employed in both tabular and deep RL settings, they have typically relied on a ground-truth interpretation of the domain-specific vocabulary that form the building blocks of the reward function. Such ground-truth interpretations can be elusive in many real-world settings, due in part to partial observability or noisy sensing. In this paper, we explore the use of Reward Machines for Deep RL in noisy and uncertain environments. We characterize this problem as a POMDP and propose a suite of RL algorithms that leverage task structure under uncertain interpretation of domain-specific vocabulary. Theoretical analysis exposes pitfalls in naive approaches to this problem, while experimental results show that our algorithms successfully leverage task structure to improve performance under noisy interpretations of the vocabulary. Our results provide a general framework for exploiting Reward Machines in partially observable environments.

Via

Access Paper or Ask Questions

PRP Rebooted: Advancing the State of the Art in FOND Planning

Dec 20, 2023

Christian Muise, Sheila A. McIlraith, J. Christopher Beck

Abstract:Fully Observable Non-Deterministic (FOND) planning is a variant of classical symbolic planning in which actions are nondeterministic, with an action's outcome known only upon execution. It is a popular planning paradigm with applications ranging from robot planning to dialogue-agent design and reactive synthesis. Over the last 20 years, a number of approaches to FOND planning have emerged. In this work, we establish a new state of the art, following in the footsteps of some of the most powerful FOND planners to date. Our planner, PR2, decisively outperforms the four leading FOND planners, at times by a large margin, in 17 of 18 domains that represent a comprehensive benchmark suite. Ablation studies demonstrate the impact of various techniques we introduce, with the largest improvement coming from our novel FOND-aware heuristic.

* 13 pages, 4 figures, AAAI conference paper Update: Fixed abstract and typos

Via

Access Paper or Ask Questions

Remembering to Be Fair: On Non-Markovian Fairness in Sequential DecisionMaking

Dec 08, 2023

Parand A. Alamdari, Toryn Q. Klassen, Elliot Creager, Sheila A. McIlraith

Figure 1 for Remembering to Be Fair: On Non-Markovian Fairness in Sequential DecisionMaking

Figure 2 for Remembering to Be Fair: On Non-Markovian Fairness in Sequential DecisionMaking

Figure 3 for Remembering to Be Fair: On Non-Markovian Fairness in Sequential DecisionMaking

Figure 4 for Remembering to Be Fair: On Non-Markovian Fairness in Sequential DecisionMaking

Abstract:Fair decision making has largely been studied with respect to a single decision. In this paper we investigate the notion of fairness in the context of sequential decision making where multiple stakeholders can be affected by the outcomes of decisions, and where decision making may be informed by additional constraints and criteria beyond the requirement of fairness. In this setting, we observe that fairness often depends on the history of the sequential decision-making process and not just on the current state. To advance our understanding of this class of fairness problems, we define the notion of non-Markovian fairness in the context of sequential decision making. We identify properties of non-Markovian fairness, including notions of long-term, anytime, periodic, and bounded fairness. We further explore the interplay between non-Markovian fairness and memory, and how this can support construction of fair policies in sequential decision-making settings.

* 9 pages

Via

Access Paper or Ask Questions

Learning Symbolic Representations for Reinforcement Learning of Non-Markovian Behavior

Jan 08, 2023

Phillip J. K. Christoffersen, Andrew C. Li, Rodrigo Toro Icarte, Sheila A. McIlraith

Figure 1 for Learning Symbolic Representations for Reinforcement Learning of Non-Markovian Behavior

Figure 2 for Learning Symbolic Representations for Reinforcement Learning of Non-Markovian Behavior

Abstract:Many real-world reinforcement learning (RL) problems necessitate learning complex, temporally extended behavior that may only receive reward signal when the behavior is completed. If the reward-worthy behavior is known, it can be specified in terms of a non-Markovian reward function - a function that depends on aspects of the state-action history, rather than just the current state and action. Such reward functions yield sparse rewards, necessitating an inordinate number of experiences to find a policy that captures the reward-worthy pattern of behavior. Recent work has leveraged Knowledge Representation (KR) to provide a symbolic abstraction of aspects of the state that summarize reward-relevant properties of the state-action history and support learning a Markovian decomposition of the problem in terms of an automaton over the KR. Providing such a decomposition has been shown to vastly improve learning rates, especially when coupled with algorithms that exploit automaton structure. Nevertheless, such techniques rely on a priori knowledge of the KR. In this work, we explore how to automatically discover useful state abstractions that support learning automata over the state-action history. The result is an end-to-end algorithm that can learn optimal policies with significantly fewer environment samples than state-of-the-art RL on simple non-Markovian domains.

* 7 pages, 2 figures, presented at KR2ML workshop at NeurIPS 2020

Via

Access Paper or Ask Questions

Noisy Symbolic Abstractions for Deep RL: A case study with Reward Machines

Nov 23, 2022

Andrew C. Li, Zizhao Chen, Pashootan Vaezipoor, Toryn Q. Klassen, Rodrigo Toro Icarte, Sheila A. McIlraith

Figure 1 for Noisy Symbolic Abstractions for Deep RL: A case study with Reward Machines

Figure 2 for Noisy Symbolic Abstractions for Deep RL: A case study with Reward Machines

Figure 3 for Noisy Symbolic Abstractions for Deep RL: A case study with Reward Machines

Figure 4 for Noisy Symbolic Abstractions for Deep RL: A case study with Reward Machines

Abstract:Natural and formal languages provide an effective mechanism for humans to specify instructions and reward functions. We investigate how to generate policies via RL when reward functions are specified in a symbolic language captured by Reward Machines, an increasingly popular automaton-inspired structure. We are interested in the case where the mapping of environment state to a symbolic (here, Reward Machine) vocabulary -- commonly known as the labelling function -- is uncertain from the perspective of the agent. We formulate the problem of policy learning in Reward Machines with noisy symbolic abstractions as a special class of POMDP optimization problem, and investigate several methods to address the problem, building on existing and new techniques, the latter focused on predicting Reward Machine state, rather than on grounding of individual symbols. We analyze these methods and evaluate them experimentally under varying degrees of uncertainty in the correct interpretation of the symbolic vocabulary. We verify the strength of our approach and the limitation of existing methods via an empirical investigation on both illustrative, toy domains and partially observable, deep RL domains.

* NeurIPS Deep Reinforcement Learning Workshop 2022

Via

Access Paper or Ask Questions

Learning to Follow Instructions in Text-Based Games

Nov 08, 2022

Mathieu Tuli, Andrew C. Li, Pashootan Vaezipoor, Toryn Q. Klassen, Scott Sanner, Sheila A. McIlraith

Figure 1 for Learning to Follow Instructions in Text-Based Games

Figure 2 for Learning to Follow Instructions in Text-Based Games

Figure 3 for Learning to Follow Instructions in Text-Based Games

Figure 4 for Learning to Follow Instructions in Text-Based Games

Abstract:Text-based games present a unique class of sequential decision making problem in which agents interact with a partially observable, simulated environment via actions and observations conveyed through natural language. Such observations typically include instructions that, in a reinforcement learning (RL) setting, can directly or indirectly guide a player towards completing reward-worthy tasks. In this work, we study the ability of RL agents to follow such instructions. We conduct experiments that show that the performance of state-of-the-art text-based game agents is largely unaffected by the presence or absence of such instructions, and that these agents are typically unable to execute tasks to completion. To further study and address the task of instruction following, we equip RL agents with an internal structured representation of natural language instructions in the form of Linear Temporal Logic (LTL), a formal language that is increasingly used for temporally extended reward specification in RL. Our framework both supports and highlights the benefit of understanding the temporal semantics of instructions and in measuring progress towards achievement of such a temporally extended behaviour. Experiments with 500+ games in TextWorld demonstrate the superior performance of our approach.

* NeurIPS 2022

Via

Access Paper or Ask Questions

Challenges to Solving Combinatorially Hard Long-Horizon Deep RL Tasks

Jun 03, 2022

Andrew C. Li, Pashootan Vaezipoor, Rodrigo Toro Icarte, Sheila A. McIlraith

Figure 1 for Challenges to Solving Combinatorially Hard Long-Horizon Deep RL Tasks

Figure 2 for Challenges to Solving Combinatorially Hard Long-Horizon Deep RL Tasks

Figure 3 for Challenges to Solving Combinatorially Hard Long-Horizon Deep RL Tasks

Figure 4 for Challenges to Solving Combinatorially Hard Long-Horizon Deep RL Tasks

Abstract:Deep reinforcement learning has shown promise in discrete domains requiring complex reasoning, including games such as Chess, Go, and Hanabi. However, this type of reasoning is less often observed in long-horizon, continuous domains with high-dimensional observations, where instead RL research has predominantly focused on problems with simple high-level structure (e.g. opening a drawer or moving a robot as fast as possible). Inspired by combinatorially hard optimization problems, we propose a set of robotics tasks which admit many distinct solutions at the high-level, but require reasoning about states and rewards thousands of steps into the future for the best performance. Critically, while RL has traditionally suffered on complex, long-horizon tasks due to sparse rewards, our tasks are carefully designed to be solvable without specialized exploration. Nevertheless, our investigation finds that standard RL methods often neglect long-term effects due to discounting, while general-purpose hierarchical RL approaches struggle unless additional abstract domain knowledge can be exploited.

Via

Access Paper or Ask Questions