Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Timothy P. Lillicrap

Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning

May 01, 2024

Yuxi Xie, Anirudh Goyal, Wenyue Zheng, Min-Yen Kan, Timothy P. Lillicrap, Kenji Kawaguchi, Michael Shieh

Figure 1 for Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning

Figure 2 for Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning

Figure 3 for Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning

Figure 4 for Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning

Abstract:We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process inspired by the successful strategy employed by AlphaZero. Our work leverages Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals. To enhance consistency in intermediate steps, we combine outcome validation and stepwise self-evaluation, continually updating the quality assessment of newly generated data. The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data. Theoretical analysis reveals the critical importance of using on-policy sampled data for successful self-improving. Extensive evaluations on various arithmetic and commonsense reasoning tasks demonstrate remarkable performance improvements over existing models. For instance, our approach outperforms the Mistral-7B Supervised Fine-Tuning (SFT) baseline on GSM8K, MATH, and SciQ, with substantial percentage increases in accuracy to $80.7\%$ (+$4.8\%$), $32.2\%$ (+$3.3\%$), and $88.5\%$ (+$7.7\%$), respectively. Additionally, our research delves into the training and inference compute tradeoff, providing insights into how our method effectively maximizes performance gains.

Via

Access Paper or Ask Questions

Towards Biologically Plausible Convolutional Networks

Jun 22, 2021

Roman Pogodin, Yash Mehta, Timothy P. Lillicrap, Peter E. Latham

Figure 1 for Towards Biologically Plausible Convolutional Networks

Figure 2 for Towards Biologically Plausible Convolutional Networks

Figure 3 for Towards Biologically Plausible Convolutional Networks

Figure 4 for Towards Biologically Plausible Convolutional Networks

Abstract:Convolutional networks are ubiquitous in deep learning. They are particularly useful for images, as they reduce the number of parameters, reduce training time, and increase accuracy. However, as a model of the brain they are seriously problematic, since they require weight sharing - something real neurons simply cannot do. Consequently, while neurons in the brain can be locally connected (one of the features of convolutional networks), they cannot be convolutional. Locally connected but non-convolutional networks, however, significantly underperform convolutional ones. This is troublesome for studies that use convolutional networks to explain activity in the visual system. Here we study plausible alternatives to weight sharing that aim at the same regularization principle, which is to make each neuron within a pool react similarly to identical inputs. The most natural way to do that is by showing the network multiple translations of the same image, akin to saccades in animal vision. However, this approach requires many translations, and doesn't remove the performance gap. We propose instead to add lateral connectivity to a locally connected network, and allow learning via Hebbian plasticity. This requires the network to pause occasionally for a sleep-like phase of "weight sharing". This method enables locally connected networks to achieve nearly convolutional performance on ImageNet, thus supporting convolutional networks as a model of the visual stream.

Via

Access Paper or Ask Questions

Training Generative Adversarial Networks by Solving Ordinary Differential Equations

Oct 28, 2020

Chongli Qin, Yan Wu, Jost Tobias Springenberg, Andrew Brock, Jeff Donahue, Timothy P. Lillicrap, Pushmeet Kohli

Figure 1 for Training Generative Adversarial Networks by Solving Ordinary Differential Equations

Figure 2 for Training Generative Adversarial Networks by Solving Ordinary Differential Equations

Figure 3 for Training Generative Adversarial Networks by Solving Ordinary Differential Equations

Figure 4 for Training Generative Adversarial Networks by Solving Ordinary Differential Equations

Abstract:The instability of Generative Adversarial Network (GAN) training has frequently been attributed to gradient descent. Consequently, recent methods have aimed to tailor the models and training procedures to stabilise the discrete updates. In contrast, we study the continuous-time dynamics induced by GAN training. Both theory and toy experiments suggest that these dynamics are in fact surprisingly stable. From this perspective, we hypothesise that instabilities in training GANs arise from the integration error in discretising the continuous dynamics. We experimentally verify that well-known ODE solvers (such as Runge-Kutta) can stabilise training - when combined with a regulariser that controls the integration error. Our approach represents a radical departure from previous methods which typically use adaptive optimisation and stabilisation techniques that constrain the functional space (e.g. Spectral Normalisation). Evaluation on CIFAR-10 and ImageNet shows that our method outperforms several strong baselines, demonstrating its efficacy.

Via

Access Paper or Ask Questions

Compressive Transformers for Long-Range Sequence Modelling

Nov 13, 2019

Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Timothy P. Lillicrap

Figure 1 for Compressive Transformers for Long-Range Sequence Modelling

Figure 2 for Compressive Transformers for Long-Range Sequence Modelling

Figure 3 for Compressive Transformers for Long-Range Sequence Modelling

Figure 4 for Compressive Transformers for Long-Range Sequence Modelling

Abstract:We present the Compressive Transformer, an attentive sequence model which compresses past memories for long-range sequence learning. We find the Compressive Transformer obtains state-of-the-art language modelling results in the WikiText-103 and Enwik8 benchmarks, achieving 17.1 ppl and 0.97 bpc respectively. We also find it can model high-frequency speech effectively and can be used as a memory mechanism for RL, demonstrated on an object matching task. To promote the domain of long-range sequence learning, we propose a new open-vocabulary language modelling benchmark derived from books, PG-19.

* 19 pages, 6 figures, 10 tables

Via

Access Paper or Ask Questions

Automated curricula through setter-solver interactions

Sep 27, 2019

Sebastien Racaniere, Andrew K. Lampinen, Adam Santoro, David P. Reichert, Vlad Firoiu, Timothy P. Lillicrap

Figure 1 for Automated curricula through setter-solver interactions

Figure 2 for Automated curricula through setter-solver interactions

Figure 3 for Automated curricula through setter-solver interactions

Figure 4 for Automated curricula through setter-solver interactions

Abstract:Reinforcement learning algorithms use correlations between policies and rewards to improve agent performance. But in dynamic or sparsely rewarding environments these correlations are often too small, or rewarding events are too infrequent to make learning feasible. Human education instead relies on curricula--the breakdown of tasks into simpler, static challenges with dense rewards--to build up to complex behaviors. While curricula are also useful for artificial agents, hand-crafting them is time consuming. This has lead researchers to explore automatic curriculum generation. Here we explore automatic curriculum generation in rich, dynamic environments. Using a setter-solver paradigm we show the importance of considering goal validity, goal feasibility, and goal coverage to construct useful curricula. We demonstrate the success of our approach in rich but sparsely rewarding 2D and 3D environments, where an agent is tasked to achieve a single goal selected from a set of possible goals that varies between episodes, and identify challenges for future work. Finally, we demonstrate the value of a novel technique that guides agents towards a desired goal distribution. Altogether, these results represent a substantial step towards applying automatic task curricula to learn complex, otherwise unlearnable goals, and to our knowledge are the first to demonstrate automated curriculum generation for goal-conditioned agents in environments where the possible goals vary between episodes.

Via

Access Paper or Ask Questions

What does it mean to understand a neural network?

Jul 15, 2019

Timothy P. Lillicrap, Konrad P. Kording

Figure 1 for What does it mean to understand a neural network?

Figure 2 for What does it mean to understand a neural network?

Abstract:We can define a neural network that can learn to recognize objects in less than 100 lines of code. However, after training, it is characterized by millions of weights that contain the knowledge about many object types across visual scenes. Such networks are thus dramatically easier to understand in terms of the code that makes them than the resulting properties, such as tuning or connections. In analogy, we conjecture that rules for development and learning in brains may be far easier to understand than their resulting properties. The analogy suggests that neuroscience would benefit from a focus on learning and development.

* 9 pages, 2 figures

Via

Access Paper or Ask Questions

Experience Replay for Continual Learning

Nov 28, 2018

David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy P. Lillicrap, Greg Wayne

Figure 1 for Experience Replay for Continual Learning

Figure 2 for Experience Replay for Continual Learning

Figure 3 for Experience Replay for Continual Learning

Figure 4 for Experience Replay for Continual Learning

Abstract:Continual learning is the problem of learning new tasks or knowledge while protecting old knowledge and ideally generalizing from old experience to learn new tasks faster. Neural networks trained by stochastic gradient descent often degrade on old tasks when trained successively on new tasks with different data distributions. This phenomenon, referred to as catastrophic forgetting, is considered a major hurdle to learning with non-stationary data or sequences of new tasks, and prevents networks from continually accumulating knowledge and skills. We examine this issue in the context of reinforcement learning, in a setting where an agent is exposed to tasks in a sequence. Unlike most other work, we do not provide an explicit indication to the model of task boundaries, which is the most general circumstance for a learning agent exposed to continuous experience. While various methods to counteract catastrophic forgetting have recently been proposed, we explore a straightforward, general, and seemingly overlooked solution - that of using experience replay buffers for all past events - with a mixture of on- and off-policy learning, leveraging behavioral cloning. We show that this strategy can still learn new tasks quickly yet can substantially reduce catastrophic forgetting in both Atari and DMLab domains, even matching the performance of methods that require task identities. When buffer storage is constrained, we confirm that a simple mechanism for randomly discarding data allows a limited size buffer to perform almost as well as an unbounded one.

Via

Access Paper or Ask Questions

Learning to Learn without Gradient Descent by Gradient Descent

Jun 12, 2017

Yutian Chen, Matthew W. Hoffman, Sergio Gomez Colmenarejo, Misha Denil, Timothy P. Lillicrap, Matt Botvinick, Nando de Freitas

Figure 1 for Learning to Learn without Gradient Descent by Gradient Descent

Figure 2 for Learning to Learn without Gradient Descent by Gradient Descent

Figure 3 for Learning to Learn without Gradient Descent by Gradient Descent

Figure 4 for Learning to Learn without Gradient Descent by Gradient Descent

Abstract:We learn recurrent neural network optimizers trained on simple synthetic functions by gradient descent. We show that these learned optimizers exhibit a remarkable degree of transfer in that they can be used to efficiently optimize a broad range of derivative-free black-box functions, including Gaussian process bandits, simple control objectives, global optimization benchmarks and hyper-parameter tuning tasks. Up to the training horizon, the learned optimizers learn to trade-off exploration and exploitation, and compare favourably with heavily engineered Bayesian optimization packages for hyper-parameter tuning.

* Accepted by ICML 2017. Previous version "Learning to Learn for Global Optimization of Black Box Functions" was published in the Deep Reinforcement Learning Workshop, NIPS 2016

Via

Access Paper or Ask Questions

Asynchronous Methods for Deep Reinforcement Learning

Jun 16, 2016

Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, Koray Kavukcuoglu

Figure 1 for Asynchronous Methods for Deep Reinforcement Learning

Figure 2 for Asynchronous Methods for Deep Reinforcement Learning

Figure 3 for Asynchronous Methods for Deep Reinforcement Learning

Figure 4 for Asynchronous Methods for Deep Reinforcement Learning

Abstract:We propose a conceptually simple and lightweight framework for deep reinforcement learning that uses asynchronous gradient descent for optimization of deep neural network controllers. We present asynchronous variants of four standard reinforcement learning algorithms and show that parallel actor-learners have a stabilizing effect on training allowing all four methods to successfully train neural network controllers. The best performing method, an asynchronous variant of actor-critic, surpasses the current state-of-the-art on the Atari domain while training for half the time on a single multi-core CPU instead of a GPU. Furthermore, we show that asynchronous actor-critic succeeds on a wide variety of continuous motor control problems as well as on a new task of navigating random 3D mazes using a visual input.

* ICML 2016

Via

Access Paper or Ask Questions

Continuous control with deep reinforcement learning

Feb 29, 2016

Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, Daan Wierstra

Figure 1 for Continuous control with deep reinforcement learning

Figure 2 for Continuous control with deep reinforcement learning

Figure 3 for Continuous control with deep reinforcement learning

Figure 4 for Continuous control with deep reinforcement learning

Abstract:We adapt the ideas underlying the success of Deep Q-Learning to the continuous action domain. We present an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces. Using the same learning algorithm, network architecture and hyper-parameters, our algorithm robustly solves more than 20 simulated physics tasks, including classic problems such as cartpole swing-up, dexterous manipulation, legged locomotion and car driving. Our algorithm is able to find policies whose performance is competitive with those found by a planning algorithm with full access to the dynamics of the domain and its derivatives. We further demonstrate that for many of the tasks the algorithm can learn policies end-to-end: directly from raw pixel inputs.

* 10 pages + supplementary

Via

Access Paper or Ask Questions