Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Luckeciano C. Melo

InfoQuest: Evaluating Multi-Turn Dialogue Agents for Open-Ended Conversations with Hidden Context

Feb 17, 2025

Bryan L. M. de Oliveira, Luana G. B. Martins, Bruno Brandão, Luckeciano C. Melo

Abstract:While large language models excel at following explicit instructions, they often struggle with ambiguous or incomplete user requests, defaulting to verbose, generic responses rather than seeking clarification. We introduce InfoQuest, a multi-turn chat benchmark designed to evaluate how dialogue agents handle hidden context in open-ended user requests. The benchmark presents intentionally ambiguous scenarios that require models to engage in information-seeking dialogue through clarifying questions before providing appropriate responses. Our evaluation of both open and closed-source models reveals that while proprietary models generally perform better, all current assistants struggle with effectively gathering critical information, often requiring multiple turns to infer user intent and frequently defaulting to generic responses without proper clarification. We provide a systematic methodology for generating diverse scenarios and evaluating models' information-seeking capabilities, offering insights into the current limitations of language models in handling ambiguous requests through multi-turn interactions.

Via

Access Paper or Ask Questions

Sliding Puzzles Gym: A Scalable Benchmark for State Representation in Visual Reinforcement Learning

Oct 17, 2024

Bryan L. M. de Oliveira, Murilo L. da Luz, Bruno Brandão, Luana G. B. Martins, Telma W. de L. Soares, Luckeciano C. Melo

Figure 1 for Sliding Puzzles Gym: A Scalable Benchmark for State Representation in Visual Reinforcement Learning

Figure 2 for Sliding Puzzles Gym: A Scalable Benchmark for State Representation in Visual Reinforcement Learning

Figure 3 for Sliding Puzzles Gym: A Scalable Benchmark for State Representation in Visual Reinforcement Learning

Figure 4 for Sliding Puzzles Gym: A Scalable Benchmark for State Representation in Visual Reinforcement Learning

Abstract:Learning effective visual representations is crucial in open-world environments where agents encounter diverse and unstructured observations. This ability enables agents to extract meaningful information from raw sensory inputs, like pixels, which is essential for generalization across different tasks. However, evaluating representation learning separately from policy learning remains a challenge in most reinforcement learning (RL) benchmarks. To address this, we introduce the Sliding Puzzles Gym (SPGym), a benchmark that extends the classic 15-tile puzzle with variable grid sizes and observation spaces, including large real-world image datasets. SPGym allows scaling the representation learning challenge while keeping the latent environment dynamics and algorithmic problem fixed, providing a targeted assessment of agents' ability to form compositional and generalizable state representations. Experiments with both model-free and model-based RL algorithms, with and without explicit representation learning components, show that as the representation challenge scales, SPGym effectively distinguishes agents based on their capabilities. Moreover, SPGym reaches difficulty levels where no tested algorithm consistently excels, highlighting key challenges and opportunities for advancing representation learning for decision-making research.

Via

Access Paper or Ask Questions

Temporal-Difference Variational Continual Learning

Oct 10, 2024

Luckeciano C. Melo, Alessandro Abate, Yarin Gal

Figure 1 for Temporal-Difference Variational Continual Learning

Figure 2 for Temporal-Difference Variational Continual Learning

Figure 3 for Temporal-Difference Variational Continual Learning

Figure 4 for Temporal-Difference Variational Continual Learning

Abstract:A crucial capability of Machine Learning models in real-world applications is the ability to continuously learn new tasks. This adaptability allows them to respond to potentially inevitable shifts in the data-generating distribution over time. However, in Continual Learning (CL) settings, models often struggle to balance learning new tasks (plasticity) with retaining previous knowledge (memory stability). Consequently, they are susceptible to Catastrophic Forgetting, which degrades performance and undermines the reliability of deployed systems. Variational Continual Learning methods tackle this challenge by employing a learning objective that recursively updates the posterior distribution and enforces it to stay close to the latest posterior estimate. Nonetheless, we argue that these methods may be ineffective due to compounding approximation errors over successive recursions. To mitigate this, we propose new learning objectives that integrate the regularization effects of multiple previous posterior estimations, preventing individual errors from dominating future posterior updates and compounding over time. We reveal insightful connections between these objectives and Temporal-Difference methods, a popular learning mechanism in Reinforcement Learning and Neuroscience. We evaluate the proposed objectives on challenging versions of popular CL benchmarks, demonstrating that they outperform standard Variational CL methods and non-variational baselines, effectively alleviating Catastrophic Forgetting.

Via

Access Paper or Ask Questions

Deep Bayesian Active Learning for Preference Modeling in Large Language Models

Jun 14, 2024

Luckeciano C. Melo, Panagiotis Tigas, Alessandro Abate, Yarin Gal

Figure 1 for Deep Bayesian Active Learning for Preference Modeling in Large Language Models

Figure 2 for Deep Bayesian Active Learning for Preference Modeling in Large Language Models

Figure 3 for Deep Bayesian Active Learning for Preference Modeling in Large Language Models

Figure 4 for Deep Bayesian Active Learning for Preference Modeling in Large Language Models

Abstract:Leveraging human preferences for steering the behavior of Large Language Models (LLMs) has demonstrated notable success in recent years. Nonetheless, data selection and labeling are still a bottleneck for these systems, particularly at large scale. Hence, selecting the most informative points for acquiring human feedback may considerably reduce the cost of preference labeling and unleash the further development of LLMs. Bayesian Active Learning provides a principled framework for addressing this challenge and has demonstrated remarkable success in diverse settings. However, previous attempts to employ it for Preference Modeling did not meet such expectations. In this work, we identify that naive epistemic uncertainty estimation leads to the acquisition of redundant samples. We address this by proposing the Bayesian Active Learner for Preference Modeling (BAL-PM), a novel stochastic acquisition policy that not only targets points of high epistemic uncertainty according to the preference model but also seeks to maximize the entropy of the acquired prompt distribution in the feature space spanned by the employed LLM. Notably, our experiments demonstrate that BAL-PM requires 33% to 68% fewer preference labels in two popular human preference datasets and exceeds previous stochastic Bayesian acquisition policies.

Via

Access Paper or Ask Questions

Transformers are Meta-Reinforcement Learners

Jun 14, 2022

Luckeciano C. Melo

Figure 1 for Transformers are Meta-Reinforcement Learners

Figure 2 for Transformers are Meta-Reinforcement Learners

Figure 3 for Transformers are Meta-Reinforcement Learners

Figure 4 for Transformers are Meta-Reinforcement Learners

Abstract:The transformer architecture and variants presented remarkable success across many machine learning tasks in recent years. This success is intrinsically related to the capability of handling long sequences and the presence of context-dependent weights from the attention mechanism. We argue that these capabilities suit the central role of a Meta-Reinforcement Learning algorithm. Indeed, a meta-RL agent needs to infer the task from a sequence of trajectories. Furthermore, it requires a fast adaptation strategy to adapt its policy for a new task -- which can be achieved using the self-attention mechanism. In this work, we present TrMRL (Transformers for Meta-Reinforcement Learning), a meta-RL agent that mimics the memory reinstatement mechanism using the transformer architecture. It associates the recent past of working memories to build an episodic memory recursively through the transformer layers. We show that the self-attention computes a consensus representation that minimizes the Bayes Risk at each layer and provides meaningful features to compute the best actions. We conducted experiments in high-dimensional continuous control environments for locomotion and dexterous manipulation. Results show that TrMRL presents comparable or superior asymptotic performance, sample efficiency, and out-of-distribution generalization compared to the baselines in these environments.

* Published at the International Conference on Machine Learning (ICML) 2022

Via

Access Paper or Ask Questions

MARS-Gym: A Gym framework to model, train, and evaluate Recommender Systems for Marketplaces

Sep 30, 2020

Marlesson R. O. Santana, Luckeciano C. Melo, Fernando H. F. Camargo, Bruno Brandão, Anderson Soares, Renan M. Oliveira, Sandor Caetano

Figure 1 for MARS-Gym: A Gym framework to model, train, and evaluate Recommender Systems for Marketplaces

Figure 2 for MARS-Gym: A Gym framework to model, train, and evaluate Recommender Systems for Marketplaces

Figure 3 for MARS-Gym: A Gym framework to model, train, and evaluate Recommender Systems for Marketplaces

Figure 4 for MARS-Gym: A Gym framework to model, train, and evaluate Recommender Systems for Marketplaces

Abstract:Recommender Systems are especially challenging for marketplaces since they must maximize user satisfaction while maintaining the healthiness and fairness of such ecosystems. In this context, we observed a lack of resources to design, train, and evaluate agents that learn by interacting within these environments. For this matter, we propose MARS-Gym, an open-source framework to empower researchers and engineers to quickly build and evaluate Reinforcement Learning agents for recommendations in marketplaces. MARS-Gym addresses the whole development pipeline: data processing, model design and optimization, and multi-sided evaluation. We also provide the implementation of a diverse set of baseline agents, with a metrics-driven analysis of them in the Trivago marketplace dataset, to illustrate how to conduct a holistic assessment using the available metrics of recommendation, off-policy estimation, and fairness. With MARS-Gym, we expect to bridge the gap between academic research and production systems, as well as to facilitate the design of new algorithms and applications.

* 15 pages, 14 figures, see https://github.com/deeplearningbrasil/mars-gym

Via

Access Paper or Ask Questions

Bottom-Up Meta-Policy Search

Oct 22, 2019

Luckeciano C. Melo, Marcos R. O. A. Maximo, Adilson Marques da Cunha

Figure 1 for Bottom-Up Meta-Policy Search

Figure 2 for Bottom-Up Meta-Policy Search

Figure 3 for Bottom-Up Meta-Policy Search

Figure 4 for Bottom-Up Meta-Policy Search

Abstract:Despite of the recent progress in agents that learn through interaction, there are several challenges in terms of sample efficiency and generalization across unseen behaviors during training. To mitigate these problems, we propose and apply a first-order Meta-Learning algorithm called Bottom-Up Meta-Policy Search (BUMPS), which works with two-phase optimization procedure: firstly, in a meta-training phase, it distills few expert policies to create a meta-policy capable of generalizing knowledge to unseen tasks during training; secondly, it applies a fast adaptation strategy named Policy Filtering, which evaluates few policies sampled from the meta-policy distribution and selects which best solves the task. We conducted all experiments in the RoboCup 3D Soccer Simulation domain, in the context of kick motion learning. We show that, given our experimental setup, BUMPS works in scenarios where simple multi-task Reinforcement Learning does not. Finally, we performed experiments in a way to evaluate each component of the algorithm.

Via

Access Paper or Ask Questions

Learning Humanoid Robot Running Skills through Proximal Policy Optimization

Oct 22, 2019

Luckeciano C. Melo, Marcos R. O. A. Maximo

Figure 1 for Learning Humanoid Robot Running Skills through Proximal Policy Optimization

Figure 2 for Learning Humanoid Robot Running Skills through Proximal Policy Optimization

Figure 3 for Learning Humanoid Robot Running Skills through Proximal Policy Optimization

Figure 4 for Learning Humanoid Robot Running Skills through Proximal Policy Optimization

Abstract:In the current level of evolution of Soccer 3D, motion control is a key factor in team's performance. Recent works takes advantages of model-free approaches based on Machine Learning to exploit robot dynamics in order to obtain faster locomotion skills, achieving running policies and, therefore, opening a new research direction in the Soccer 3D environment. In this work, we present a methodology based on Deep Reinforcement Learning that learns running skills without any prior knowledge, using a neural network whose inputs are related to robot's dynamics. Our results outperformed the previous state-of-the-art sprint velocity reported in Soccer 3D literature by a significant margin. It also demonstrated improvement in sample efficiency, being able to learn how to run in just few hours. We reported our results analyzing the training procedure and also evaluating the policies in terms of speed, reliability and human similarity. Finally, we presented key factors that lead us to improve previous results and shared some ideas for future work.

Via

Access Paper or Ask Questions