Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Francisco S. Melo

Implicit Repair with Reinforcement Learning in Emergent Communication

Feb 18, 2025

Fábio Vital, Alberto Sardinha, Francisco S. Melo

Abstract:Conversational repair is a mechanism used to detect and resolve miscommunication and misinformation problems when two or more agents interact. One particular and underexplored form of repair in emergent communication is the implicit repair mechanism, where the interlocutor purposely conveys the desired information in such a way as to prevent misinformation from any other interlocutor. This work explores how redundancy can modify the emergent communication protocol to continue conveying the necessary information to complete the underlying task, even with additional external environmental pressures such as noise. We focus on extending the signaling game, called the Lewis Game, by adding noise in the communication channel and inputs received by the agents. Our analysis shows that agents add redundancy to the transmitted messages as an outcome to prevent the negative impact of noise on the task success. Additionally, we observe that the emerging communication protocol's generalization capabilities remain equivalent to architectures employed in simpler games that are entirely deterministic. Additionally, our method is the only one suitable for producing robust communication protocols that can handle cases with and without noise while maintaining increased generalization performance levels.

* AAMAS 2025 - full paper

Via

Access Paper or Ask Questions

Distributed Value Decomposition Networks with Networked Agents

Feb 11, 2025

Guilherme S. Varela, Alberto Sardinha, Francisco S. Melo

Abstract:We investigate the problem of distributed training under partial observability, whereby cooperative multi-agent reinforcement learning agents (MARL) maximize the expected cumulative joint reward. We propose distributed value decomposition networks (DVDN) that generate a joint Q-function that factorizes into agent-wise Q-functions. Whereas the original value decomposition networks rely on centralized training, our approach is suitable for domains where centralized training is not possible and agents must learn by interacting with the physical environment in a decentralized manner while communicating with their peers. DVDN overcomes the need for centralized training by locally estimating the shared objective. We contribute with two innovative algorithms, DVDN and DVDN (GT), for the heterogeneous and homogeneous agents settings respectively. Empirically, both algorithms approximate the performance of value decomposition networks, in spite of the information loss during communication, as demonstrated in ten MARL tasks in three standard environments.

* 21 pages, 15 figures, to be published in Proceedings of the 24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2025), Detroit, Michigan, USA, May 19 - 23, 2025, IFAAMAS

Via

Access Paper or Ask Questions

Networked Agents in the Dark: Team Value Learning under Partial Observability

Jan 15, 2025

Guilherme S. Varela, Alberto Sardinha, Francisco S. Melo

Abstract:We propose a novel cooperative multi-agent reinforcement learning (MARL) approach for networked agents. In contrast to previous methods that rely on complete state information or joint observations, our agents must learn how to reach shared objectives under partial observability. During training, they collect individual rewards and approximate a team value function through local communication, resulting in cooperative behavior. To describe our problem, we introduce the networked dynamic partially observable Markov game framework, where agents communicate over a switching topology communication network. Our distributed method, DNA-MARL, uses a consensus mechanism for local communication and gradient descent for local computation. DNA-MARL increases the range of the possible applications of networked agents, being well-suited for real world domains that impose privacy and where the messages may not reach their recipients. We evaluate DNA-MARL across benchmark MARL scenarios. Our results highlight the superior performance of DNA-MARL over previous methods.

* 18 pages, 7 figures, 5 tables. Accepted as supplemental material at Proceedings of the 24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2025), Detroit, Michigan, USA, May 19 - 23, 2025, IFAAMAS

Via

Access Paper or Ask Questions

NeuralThink: Algorithm Synthesis that Extrapolates in General Tasks

Feb 23, 2024

Bernardo Esteves, Miguel Vasco, Francisco S. Melo

Figure 1 for NeuralThink: Algorithm Synthesis that Extrapolates in General Tasks

Figure 2 for NeuralThink: Algorithm Synthesis that Extrapolates in General Tasks

Figure 3 for NeuralThink: Algorithm Synthesis that Extrapolates in General Tasks

Figure 4 for NeuralThink: Algorithm Synthesis that Extrapolates in General Tasks

Abstract:While machine learning methods excel at pattern recognition, they struggle with complex reasoning tasks in a scalable, algorithmic manner. Recent Deep Thinking methods show promise in learning algorithms that extrapolate: learning in smaller environments and executing the learned algorithm in larger environments. However, these works are limited to symmetrical tasks, where the input and output dimensionalities are the same. To address this gap, we propose NeuralThink, a new recurrent architecture that can consistently extrapolate to both symmetrical and asymmetrical tasks, where the dimensionality of the input and output are different. We contribute with a novel benchmark of asymmetrical tasks for extrapolation. We show that NeuralThink consistently outperforms the prior state-of-the-art Deep Thinking architectures, in regards to stable extrapolation to large observations from smaller training sizes.

Via

Access Paper or Ask Questions

Making Friends in the Dark: Ad Hoc Teamwork Under Partial Observability

Sep 30, 2023

João G. Ribeiroa, Cassandro Martinhoa, Alberto Sardinhaa, Francisco S. Melo

Figure 1 for Making Friends in the Dark: Ad Hoc Teamwork Under Partial Observability

Figure 2 for Making Friends in the Dark: Ad Hoc Teamwork Under Partial Observability

Figure 3 for Making Friends in the Dark: Ad Hoc Teamwork Under Partial Observability

Figure 4 for Making Friends in the Dark: Ad Hoc Teamwork Under Partial Observability

Abstract:This paper introduces a formal definition of the setting of ad hoc teamwork under partial observability and proposes a first-principled model-based approach which relies only on prior knowledge and partial observations of the environment in order to perform ad hoc teamwork. We make three distinct assumptions that set it apart previous works, namely: i) the state of the environment is always partially observable, ii) the actions of the teammates are always unavailable to the ad hoc agent and iii) the ad hoc agent has no access to a reward signal which could be used to learn the task from scratch. Our results in 70 POMDPs from 11 domains show that our approach is not only effective in assisting unknown teammates in solving unknown tasks but is also robust in scaling to more challenging problems.

* arXiv admin note: text overlap with arXiv:2201.03538

Via

Access Paper or Ask Questions

Multi-Bellman operator for convergence of $Q$-learning with linear function approximation

Sep 28, 2023

Diogo S. Carvalho, Pedro A. Santos, Francisco S. Melo

Abstract:We study the convergence of $Q$-learning with linear function approximation. Our key contribution is the introduction of a novel multi-Bellman operator that extends the traditional Bellman operator. By exploring the properties of this operator, we identify conditions under which the projected multi-Bellman operator becomes contractive, providing improved fixed-point guarantees compared to the Bellman operator. To leverage these insights, we propose the multi $Q$-learning algorithm with linear function approximation. We demonstrate that this algorithm converges to the fixed-point of the projected multi-Bellman operator, yielding solutions of arbitrary accuracy. Finally, we validate our approach by applying it to well-known environments, showcasing the effectiveness and applicability of our findings.

Via

Access Paper or Ask Questions

Interactively Teaching an Inverse Reinforcement Learner with Limited Feedback

Sep 16, 2023

Rustam Zayanov, Francisco S. Melo, Manuel Lopes

Abstract:We study the problem of teaching via demonstrations in sequential decision-making tasks. In particular, we focus on the situation when the teacher has no access to the learner's model and policy, and the feedback from the learner is limited to trajectories that start from states selected by the teacher. The necessity to select the starting states and infer the learner's policy creates an opportunity for using the methods of inverse reinforcement learning and active learning by the teacher. In this work, we formalize the teaching process with limited feedback and propose an algorithm that solves this teaching problem. The algorithm uses a modified version of the active value-at-risk method to select the starting states, a modified maximum causal entropy algorithm to infer the policy, and the difficulty score ratio method to choose the teaching demonstrations. We test the algorithm in a synthetic car driving environment and conclude that the proposed algorithm is an effective solution when the learner's feedback is limited.

* 7 pages, 3 figures

Via

Access Paper or Ask Questions

Learning to Perceive in Deep Model-Free Reinforcement Learning

Jan 13, 2023

Gonçalo Querido, Alberto Sardinha, Francisco S. Melo

Abstract:This work proposes a novel model-free Reinforcement Learning (RL) agent that is able to learn how to complete an unknown task having access to only a part of the input observation. We take inspiration from the concepts of visual attention and active perception that are characteristic of humans and tried to apply them to our agent, creating a hard attention mechanism. In this mechanism, the model decides first which region of the input image it should look at, and only after that it has access to the pixels of that region. Current RL agents do not follow this principle and we have not seen these mechanisms applied to the same purpose as this work. In our architecture, we adapt an existing model called recurrent attention model (RAM) and combine it with the proximal policy optimization (PPO) algorithm. We investigate whether a model with these characteristics is capable of achieving similar performance to state-of-the-art model-free RL agents that access the full input observation. This analysis is made in two Atari games, Pong and SpaceInvaders, which have a discrete action space, and in CarRacing, which has a continuous action space. Besides assessing its performance, we also analyze the movement of the attention of our model and compare it with what would be an example of the human behavior. Even with such visual limitation, we show that our model matches the performance of PPO+LSTM in two of the three games tested.

* 8 pages; 7 figures; fixed author name; added link for code

Via

Access Paper or Ask Questions

Centralized Training with Hybrid Execution in Multi-Agent Reinforcement Learning

Oct 12, 2022

Pedro P. Santos, Diogo S. Carvalho, Miguel Vasco, Alberto Sardinha, Pedro A. Santos, Ana Paiva, Francisco S. Melo

Figure 1 for Centralized Training with Hybrid Execution in Multi-Agent Reinforcement Learning

Figure 2 for Centralized Training with Hybrid Execution in Multi-Agent Reinforcement Learning

Figure 3 for Centralized Training with Hybrid Execution in Multi-Agent Reinforcement Learning

Figure 4 for Centralized Training with Hybrid Execution in Multi-Agent Reinforcement Learning

Abstract:We introduce hybrid execution in multi-agent reinforcement learning (MARL), a new paradigm in which agents aim to successfully perform cooperative tasks with any communication level at execution time by taking advantage of information-sharing among the agents. Under hybrid execution, the communication level can range from a setting in which no communication is allowed between agents (fully decentralized), to a setting featuring full communication (fully centralized). To formalize our setting, we define a new class of multi-agent partially observable Markov decision processes (POMDPs) that we name hybrid-POMDPs, which explicitly models a communication process between the agents. We contribute MARO, an approach that combines an autoregressive predictive model to estimate missing agents' observations, and a dropout-based RL training scheme that simulates different communication levels during the centralized training phase. We evaluate MARO on standard scenarios and extensions of previous benchmarks tailored to emphasize the negative impact of partial observability in MARL. Experimental results show that our method consistently outperforms baselines, allowing agents to act with faulty communication while successfully exploiting shared information.

Via

Access Paper or Ask Questions

"Guess what I'm doing": Extending legibility to sequential decision tasks

Sep 19, 2022

Miguel Faria, Francisco S. Melo, Ana Paiva

Figure 1 for "Guess what I'm doing": Extending legibility to sequential decision tasks

Figure 2 for "Guess what I'm doing": Extending legibility to sequential decision tasks

Figure 3 for "Guess what I'm doing": Extending legibility to sequential decision tasks

Figure 4 for "Guess what I'm doing": Extending legibility to sequential decision tasks

Abstract:In this paper we investigate the notion of legibility in sequential decision tasks under uncertainty. Previous works that extend legibility to scenarios beyond robot motion either focus on deterministic settings or are computationally too expensive. Our proposed approach, dubbed PoL-MDP, is able to handle uncertainty while remaining computationally tractable. We establish the advantages of our approach against state-of-the-art approaches in several simulated scenarios of different complexity. We also showcase the use of our legible policies as demonstrations for an inverse reinforcement learning agent, establishing their superiority against the commonly used demonstrations based on the optimal policy. Finally, we assess the legibility of our computed policies through a user study where people are asked to infer the goal of a mobile robot following a legible policy by observing its actions.

Via

Access Paper or Ask Questions