Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Seyed Mohammad Asghari

Efficient Exploration at Scale

Mar 18, 2026

Seyed Mohammad Asghari, Chris Chute, Vikranth Dwaracherla, Xiuyuan Lu, Mehdi Jafarnia, Victor Minden, Zheng Wen, Benjamin Van Roy

Abstract:We develop an online learning algorithm that dramatically improves the data efficiency of reinforcement learning from human feedback (RLHF). Our algorithm incrementally updates reward and language models as choice data is received. The reward model is fit to the choice data, while the language model is updated by a variation of reinforce, with reinforcement signals provided by the reward model. Several features enable the efficiency gains: a small affirmative nudge added to each reinforcement signal, an epistemic neural network that models reward uncertainty, and information-directed exploration. With Gemma large language models (LLMs), our algorithm matches the performance of offline RLHF trained on 200K labels using fewer than 20K labels, representing more than a 10x gain in data efficiency. Extrapolating from our results, we expect our algorithm trained on 1M labels to match offline RLHF trained on 1B labels. This represents a 1,000x gain. To our knowledge, these are the first results to demonstrate that such large improvements are possible.

Via

Access Paper or Ask Questions

Efficient Exploration for LLMs

Feb 01, 2024

Vikranth Dwaracherla, Seyed Mohammad Asghari, Botao Hao, Benjamin Van Roy

Figure 1 for Efficient Exploration for LLMs

Figure 2 for Efficient Exploration for LLMs

Figure 3 for Efficient Exploration for LLMs

Figure 4 for Efficient Exploration for LLMs

Abstract:We present evidence of substantial benefit from efficient exploration in gathering human feedback to improve large language models. In our experiments, an agent sequentially generates queries while fitting a reward model to the feedback received. Our best-performing agent generates queries using double Thompson sampling, with uncertainty represented by an epistemic neural network. Our results demonstrate that efficient exploration enables high levels of performance with far fewer queries. Further, both uncertainty estimation and the choice of exploration scheme play critical roles.

Via

Access Paper or Ask Questions

Approximate Thompson Sampling via Epistemic Neural Networks

Feb 18, 2023

Ian Osband, Zheng Wen, Seyed Mohammad Asghari, Vikranth Dwaracherla, Morteza Ibrahimi, Xiuyuan Lu, Benjamin Van Roy

Figure 1 for Approximate Thompson Sampling via Epistemic Neural Networks

Figure 2 for Approximate Thompson Sampling via Epistemic Neural Networks

Figure 3 for Approximate Thompson Sampling via Epistemic Neural Networks

Figure 4 for Approximate Thompson Sampling via Epistemic Neural Networks

Abstract:Thompson sampling (TS) is a popular heuristic for action selection, but it requires sampling from a posterior distribution. Unfortunately, this can become computationally intractable in complex environments, such as those modeled using neural networks. Approximate posterior samples can produce effective actions, but only if they reasonably approximate joint predictive distributions of outputs across inputs. Notably, accuracy of marginal predictive distributions does not suffice. Epistemic neural networks (ENNs) are designed to produce accurate joint predictive distributions. We compare a range of ENNs through computational experiments that assess their performance in approximating TS across bandit and reinforcement learning environments. The results indicate that ENNs serve this purpose well and illustrate how the quality of joint predictive distributions drives performance. Further, we demonstrate that the \textit{epinet} -- a small additive network that estimates uncertainty -- matches the performance of large ensembles at orders of magnitude lower computational cost. This enables effective application of TS with computation that scales gracefully to complex environments.

Via

Access Paper or Ask Questions

Fine-Tuning Language Models via Epistemic Neural Networks

Nov 03, 2022

Ian Osband, Seyed Mohammad Asghari, Benjamin Van Roy, Nat McAleese, John Aslanides, Geoffrey Irving

Figure 1 for Fine-Tuning Language Models via Epistemic Neural Networks

Figure 2 for Fine-Tuning Language Models via Epistemic Neural Networks

Figure 3 for Fine-Tuning Language Models via Epistemic Neural Networks

Figure 4 for Fine-Tuning Language Models via Epistemic Neural Networks

Abstract:Large language models are now part of a powerful new paradigm in machine learning. These models learn a wide range of capabilities from training on large unsupervised text corpora. In many applications, these capabilities are then fine-tuned through additional training on specialized data to improve performance in that setting. In this paper, we augment these models with an epinet: a small additional network architecture that helps to estimate model uncertainty and form an epistemic neural network (ENN). ENNs are neural networks that can know what they don't know. We show that, using an epinet to prioritize uncertain data, we can fine-tune BERT on GLUE tasks to the same performance while using 2x less data. We also investigate performance in synthetic neural network generative models designed to build understanding. In each setting, using an epinet outperforms heuristic active learning schemes.

Via

Access Paper or Ask Questions

Robustness of Epinets against Distributional Shifts

Jul 01, 2022

Xiuyuan Lu, Ian Osband, Seyed Mohammad Asghari, Sven Gowal, Vikranth Dwaracherla, Zheng Wen, Benjamin Van Roy

Figure 1 for Robustness of Epinets against Distributional Shifts

Figure 2 for Robustness of Epinets against Distributional Shifts

Figure 3 for Robustness of Epinets against Distributional Shifts

Figure 4 for Robustness of Epinets against Distributional Shifts

Abstract:Recent work introduced the epinet as a new approach to uncertainty modeling in deep learning. An epinet is a small neural network added to traditional neural networks, which, together, can produce predictive distributions. In particular, using an epinet can greatly improve the quality of joint predictions across multiple inputs, a measure of how well a neural network knows what it does not know. In this paper, we examine whether epinets can offer similar advantages under distributional shifts. We find that, across ImageNet-A/O/C, epinets generally improve robustness metrics. Moreover, these improvements are more significant than those afforded by even very large ensembles at orders of magnitude lower computational costs. However, these improvements are relatively small compared to the outstanding issues in distributionally-robust deep learning. Epinets may be a useful tool in the toolbox, but they are far from the complete solution.

Via

Access Paper or Ask Questions

Ensembles for Uncertainty Estimation: Benefits of Prior Functions and Bootstrapping

Jun 08, 2022

Vikranth Dwaracherla, Zheng Wen, Ian Osband, Xiuyuan Lu, Seyed Mohammad Asghari, Benjamin Van Roy

Figure 1 for Ensembles for Uncertainty Estimation: Benefits of Prior Functions and Bootstrapping

Figure 2 for Ensembles for Uncertainty Estimation: Benefits of Prior Functions and Bootstrapping

Figure 3 for Ensembles for Uncertainty Estimation: Benefits of Prior Functions and Bootstrapping

Figure 4 for Ensembles for Uncertainty Estimation: Benefits of Prior Functions and Bootstrapping

Abstract:In machine learning, an agent needs to estimate uncertainty to efficiently explore and adapt and to make effective decisions. A common approach to uncertainty estimation maintains an ensemble of models. In recent years, several approaches have been proposed for training ensembles, and conflicting views prevail with regards to the importance of various ingredients of these approaches. In this paper, we aim to address the benefits of two ingredients -- prior functions and bootstrapping -- which have come into question. We show that prior functions can significantly improve an ensemble agent's joint predictions across inputs and that bootstrapping affords additional benefits if the signal-to-noise ratio varies across inputs. Our claims are justified by both theoretical and experimental results.

Via

Access Paper or Ask Questions

Evaluating High-Order Predictive Distributions in Deep Learning

Feb 28, 2022

Ian Osband, Zheng Wen, Seyed Mohammad Asghari, Vikranth Dwaracherla, Xiuyuan Lu, Benjamin Van Roy

Figure 1 for Evaluating High-Order Predictive Distributions in Deep Learning

Figure 2 for Evaluating High-Order Predictive Distributions in Deep Learning

Figure 3 for Evaluating High-Order Predictive Distributions in Deep Learning

Figure 4 for Evaluating High-Order Predictive Distributions in Deep Learning

Abstract:Most work on supervised learning research has focused on marginal predictions. In decision problems, joint predictive distributions are essential for good performance. Previous work has developed methods for assessing low-order predictive distributions with inputs sampled i.i.d. from the testing distribution. With low-dimensional inputs, these methods distinguish agents that effectively estimate uncertainty from those that do not. We establish that the predictive distribution order required for such differentiation increases greatly with input dimension, rendering these methods impractical. To accommodate high-dimensional inputs, we introduce \textit{dyadic sampling}, which focuses on predictive distributions associated with random \textit{pairs} of inputs. We demonstrate that this approach efficiently distinguishes agents in high-dimensional examples involving simple logistic regression as well as complex synthetic and empirical data.

Via

Access Paper or Ask Questions

Evaluating Predictive Distributions: Does Bayesian Deep Learning Work?

Oct 09, 2021

Ian Osband, Zheng Wen, Seyed Mohammad Asghari, Vikranth Dwaracherla, Botao Hao, Morteza Ibrahimi, Dieterich Lawson, Xiuyuan Lu, Brendan O'Donoghue, Benjamin Van Roy

Figure 1 for Evaluating Predictive Distributions: Does Bayesian Deep Learning Work?

Figure 2 for Evaluating Predictive Distributions: Does Bayesian Deep Learning Work?

Figure 3 for Evaluating Predictive Distributions: Does Bayesian Deep Learning Work?

Figure 4 for Evaluating Predictive Distributions: Does Bayesian Deep Learning Work?

Abstract:Posterior predictive distributions quantify uncertainties ignored by point estimates. This paper introduces \textit{The Neural Testbed}, which provides tools for the systematic evaluation of agents that generate such predictions. Crucially, these tools assess not only the quality of marginal predictions per input, but also joint predictions given many inputs. Joint distributions are often critical for useful uncertainty quantification, but they have been largely overlooked by the Bayesian deep learning community. We benchmark several approaches to uncertainty estimation using a neural-network-based data generating process. Our results reveal the importance of evaluation beyond marginal predictions. Further, they reconcile sources of confusion in the field, such as why Bayesian deep learning approaches that generate accurate marginal predictions perform poorly in sequential decision tasks, how incorporating priors can be helpful, and what roles epistemic versus aleatoric uncertainty play when evaluating performance. We also present experiments on real-world challenge datasets, which show a high correlation with testbed results, and that the importance of evaluating joint predictive distributions carries over to real data. As part of this effort, we opensource The Neural Testbed, including all implementations from this paper.

Via

Access Paper or Ask Questions

Regret Bounds for Decentralized Learning in Cooperative Multi-Agent Dynamical Systems

Jan 27, 2020

Seyed Mohammad Asghari, Yi Ouyang, Ashutosh Nayyar

Figure 1 for Regret Bounds for Decentralized Learning in Cooperative Multi-Agent Dynamical Systems

Figure 2 for Regret Bounds for Decentralized Learning in Cooperative Multi-Agent Dynamical Systems

Figure 3 for Regret Bounds for Decentralized Learning in Cooperative Multi-Agent Dynamical Systems

Figure 4 for Regret Bounds for Decentralized Learning in Cooperative Multi-Agent Dynamical Systems

Abstract:Regret analysis is challenging in Multi-Agent Reinforcement Learning (MARL) primarily due to the dynamical environments and the decentralized information among agents. We attempt to solve this challenge in the context of decentralized learning in multi-agent linear-quadratic (LQ) dynamical systems. We begin with a simple setup consisting of two agents and two dynamically decoupled stochastic linear systems, each system controlled by an agent. The systems are coupled through a quadratic cost function. When both systems' dynamics are unknown and there is no communication among the agents, we show that no learning policy can generate sub-linear in $T$ regret, where $T$ is the time horizon. When only one system's dynamics are unknown and there is one-directional communication from the agent controlling the unknown system to the other agent, we propose a MARL algorithm based on the construction of an auxiliary single-agent LQ problem. The auxiliary single-agent problem in the proposed MARL algorithm serves as an implicit coordination mechanism among the two learning agents. This allows the agents to achieve a regret within $O(\sqrt{T})$ of the regret of the auxiliary single-agent problem. Consequently, using existing results for single-agent LQ regret, our algorithm provides a $\tilde{O}(\sqrt{T})$ regret bound. (Here $\tilde{O}(\cdot)$ hides constants and logarithmic factors). Our numerical experiments indicate that this bound is matched in practice. From the two-agent problem, we extend our results to multi-agent LQ systems with certain communication patterns.

Via

Access Paper or Ask Questions

Learning to Code: Coded Caching via Deep Reinforcement Learning

Dec 09, 2019

Navid Naderializadeh, Seyed Mohammad Asghari

Figure 1 for Learning to Code: Coded Caching via Deep Reinforcement Learning

Figure 2 for Learning to Code: Coded Caching via Deep Reinforcement Learning

Figure 3 for Learning to Code: Coded Caching via Deep Reinforcement Learning

Figure 4 for Learning to Code: Coded Caching via Deep Reinforcement Learning

Abstract:We consider a system comprising a file library and a network with a server and multiple users equipped with cache memories. The system operates in two phases: a prefetching phase, where users load their caches with parts of contents from the library, and a delivery phase, where users request files from the library and the server needs to send the uncached parts of the requested files to the users. For the case where the users' caches are arbitrarily loaded, we propose an algorithm based on deep reinforcement learning to minimize the delay of delivering requested contents to the users in the delivery phase. Simulation results demonstrate that our proposed deep reinforcement learning agent learns a coded delivery strategy for sending the requests to the users, which slightly outperforms the state-of-the-art performance in terms of delivery delay, while drastically reducing the computational complexity.

* Presented at the 2019 Asilomar Conference on Signals, Systems, and Computers

Via

Access Paper or Ask Questions