Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Danica J. Sutherland

Efficient kernelized bandit algorithms via exploration distributions

Jun 11, 2025

Bingshan Hu, Zheng He, Danica J. Sutherland

Abstract:We consider a kernelized bandit problem with a compact arm set ${X} \subset \mathbb{R}^d $ and a fixed but unknown reward function $f^*$ with a finite norm in some Reproducing Kernel Hilbert Space (RKHS). We propose a class of computationally efficient kernelized bandit algorithms, which we call GP-Generic, based on a novel concept: exploration distributions. This class of algorithms includes Upper Confidence Bound-based approaches as a special case, but also allows for a variety of randomized algorithms. With careful choice of exploration distribution, our proposed generic algorithm realizes a wide range of concrete algorithms that achieve $\tilde{O}(\gamma_T\sqrt{T})$ regret bounds, where $\gamma_T$ characterizes the RKHS complexity. This matches known results for UCB- and Thompson Sampling-based algorithms; we also show that in practice, randomization can yield better practical results.

Via

Access Paper or Ask Questions

On the Effect of Negative Gradient in Group Relative Deep Reinforcement Optimization

May 24, 2025

Wenlong Deng, Yi Ren, Muchen Li, Danica J. Sutherland, Xiaoxiao Li, Christos Thrampoulidis

Abstract:Reinforcement learning (RL) has become popular in enhancing the reasoning capabilities of large language models (LLMs), with Group Relative Policy Optimization (GRPO) emerging as a widely used algorithm in recent systems. Despite GRPO's widespread adoption, we identify a previously unrecognized phenomenon we term Lazy Likelihood Displacement (LLD), wherein the likelihood of correct responses marginally increases or even decreases during training. This behavior mirrors a recently discovered misalignment issue in Direct Preference Optimization (DPO), attributed to the influence of negative gradients. We provide a theoretical analysis of GRPO's learning dynamic, identifying the source of LLD as the naive penalization of all tokens in incorrect responses with the same strength. To address this, we develop a method called NTHR, which downweights penalties on tokens contributing to the LLD. Unlike prior DPO-based approaches, NTHR takes advantage of GRPO's group-based structure, using correct responses as anchors to identify influential tokens. Experiments on math reasoning benchmarks demonstrate that NTHR effectively mitigates LLD, yielding consistent performance gains across models ranging from 0.5B to 3B parameters.

Via

Access Paper or Ask Questions

Uncertainty Herding: One Active Learning Method for All Label Budgets

Dec 30, 2024

Wonho Bae, Gabriel L. Oliveira, Danica J. Sutherland

Abstract:Most active learning research has focused on methods which perform well when many labels are available, but can be dramatically worse than random selection when label budgets are small. Other methods have focused on the low-budget regime, but do poorly as label budgets increase. As the line between "low" and "high" budgets varies by problem, this is a serious issue in practice. We propose uncertainty coverage, an objective which generalizes a variety of low- and high-budget objectives, as well as natural, hyperparameter-light methods to smoothly interpolate between low- and high-budget regimes. We call greedy optimization of the estimate Uncertainty Herding; this simple method is computationally fast, and we prove that it nearly optimizes the distribution-level coverage. In experimental validation across a variety of active learning tasks, our proposal matches or beats state-of-the-art performance in essentially all cases; it is the only method of which we are aware that reliably works well in both low- and high-budget settings.

Via

Access Paper or Ask Questions

Understanding Simplicity Bias towards Compositional Mappings via Learning Dynamics

Sep 15, 2024

Yi Ren, Danica J. Sutherland

Figure 1 for Understanding Simplicity Bias towards Compositional Mappings via Learning Dynamics

Figure 2 for Understanding Simplicity Bias towards Compositional Mappings via Learning Dynamics

Figure 3 for Understanding Simplicity Bias towards Compositional Mappings via Learning Dynamics

Figure 4 for Understanding Simplicity Bias towards Compositional Mappings via Learning Dynamics

Abstract:Obtaining compositional mappings is important for the model to generalize well compositionally. To better understand when and how to encourage the model to learn such mappings, we study their uniqueness through different perspectives. Specifically, we first show that the compositional mappings are the simplest bijections through the lens of coding length (i.e., an upper bound of their Kolmogorov complexity). This property explains why models having such mappings can generalize well. We further show that the simplicity bias is usually an intrinsic property of neural network training via gradient descent. That partially explains why some models spontaneously generalize well when they are trained appropriately.

* 4 pages

Via

Access Paper or Ask Questions

Learning Deep Kernels for Non-Parametric Independence Testing

Sep 10, 2024

Nathaniel Xu, Feng Liu, Danica J. Sutherland

Figure 1 for Learning Deep Kernels for Non-Parametric Independence Testing

Figure 2 for Learning Deep Kernels for Non-Parametric Independence Testing

Figure 3 for Learning Deep Kernels for Non-Parametric Independence Testing

Figure 4 for Learning Deep Kernels for Non-Parametric Independence Testing

Abstract:The Hilbert-Schmidt Independence Criterion (HSIC) is a powerful tool for nonparametric detection of dependence between random variables. It crucially depends, however, on the selection of reasonable kernels; commonly-used choices like the Gaussian kernel, or the kernel that yields the distance covariance, are sufficient only for amply sized samples from data distributions with relatively simple forms of dependence. We propose a scheme for selecting the kernels used in an HSIC-based independence test, based on maximizing an estimate of the asymptotic test power. We prove that maximizing this estimate indeed approximately maximizes the true power of the test, and demonstrate that our learned kernels can identify forms of structured dependence between random variables in various experiments.

Via

Access Paper or Ask Questions

Why Do You Grok? A Theoretical Analysis of Grokking Modular Addition

Jul 17, 2024

Mohamad Amin Mohamadi, Zhiyuan Li, Lei Wu, Danica J. Sutherland

Abstract:We present a theoretical explanation of the ``grokking'' phenomenon, where a model generalizes long after overfitting,for the originally-studied problem of modular addition. First, we show that early in gradient descent, when the ``kernel regime'' approximately holds, no permutation-equivariant model can achieve small population error on modular addition unless it sees at least a constant fraction of all possible data points. Eventually, however, models escape the kernel regime. We show that two-layer quadratic networks that achieve zero training loss with bounded $\ell_{\infty}$ norm generalize well with substantially fewer training points, and further show such networks exist and can be found by gradient descent with small $\ell_{\infty}$ regularization. We further provide empirical evidence that these networks as well as simple Transformers, leave the kernel regime only after initially overfitting. Taken together, our results strongly support the case for grokking as a consequence of the transition from kernel-like behavior to limiting behavior of gradient descent on deep networks.

* Accepted by ICML 2024

Via

Access Paper or Ask Questions

Generalized Coverage for More Robust Low-Budget Active Learning

Jul 16, 2024

Wonho Bae, Junhyug Noh, Danica J. Sutherland

Abstract:The ProbCover method of Yehuda et al. is a well-motivated algorithm for active learning in low-budget regimes, which attempts to "cover" the data distribution with balls of a given radius at selected data points. We demonstrate, however, that the performance of this algorithm is extremely sensitive to the choice of this radius hyper-parameter, and that tuning it is quite difficult, with the original heuristic frequently failing. We thus introduce (and theoretically motivate) a generalized notion of "coverage," including ProbCover's objective as a special case, but also allowing smoother notions that are far more robust to hyper-parameter choice. We propose an efficient greedy method to optimize this coverage, generalizing ProbCover's algorithm; due to its close connection to kernel herding, we call it "MaxHerding." The objective can also be optimized non-greedily through a variant of $k$-medoids, clarifying the relationship to other low-budget active learning methods. In comprehensive experiments, MaxHerding surpasses existing active learning methods across multiple low-budget image classification benchmarks, and does so with less computational cost than most competitive methods.

* Accepted to ECCV2024

Via

Access Paper or Ask Questions

Learning Dynamics of LLM Finetuning

Jul 15, 2024

Yi Ren, Danica J. Sutherland

Figure 1 for Learning Dynamics of LLM Finetuning

Figure 2 for Learning Dynamics of LLM Finetuning

Figure 3 for Learning Dynamics of LLM Finetuning

Figure 4 for Learning Dynamics of LLM Finetuning

Abstract:Learning dynamics, which describes how the learning of specific training examples influences the model's prediction of other examples, give us a powerful tool for understanding the behavior of deep learning systems. We study the learning dynamics of large language models during finetuning, by analyzing the step-wise decomposition and accumulated influence among different responses. Our framework allows a uniform interpretation of many interesting observations about the training of popular algorithms for both instruction tuning and preference tuning. The analysis not only explains where the benefits of these methods come from but also inspires a simple, effective method to further improve the alignment performance. Code for experiments is available at https://github.com/Joshua-Ren/Learning_dynamics_LLM.

* 32 pages

Via

Access Paper or Ask Questions

Language Model Evolution: An Iterated Learning Perspective

Apr 04, 2024

Yi Ren, Shangmin Guo, Linlu Qiu, Bailin Wang, Danica J. Sutherland

Figure 1 for Language Model Evolution: An Iterated Learning Perspective

Figure 2 for Language Model Evolution: An Iterated Learning Perspective

Figure 3 for Language Model Evolution: An Iterated Learning Perspective

Figure 4 for Language Model Evolution: An Iterated Learning Perspective

Abstract:With the widespread adoption of Large Language Models (LLMs), the prevalence of iterative interactions among these models is anticipated to increase. Notably, recent advancements in multi-round self-improving methods allow LLMs to generate new examples for training subsequent models. At the same time, multi-agent LLM systems, involving automated interactions among agents, are also increasing in prominence. Thus, in both short and long terms, LLMs may actively engage in an evolutionary process. We draw parallels between the behavior of LLMs and the evolution of human culture, as the latter has been extensively studied by cognitive scientists for decades. Our approach involves leveraging Iterated Learning (IL), a Bayesian framework that elucidates how subtle biases are magnified during human cultural evolution, to explain some behaviors of LLMs. This paper outlines key characteristics of agents' behavior in the Bayesian-IL framework, including predictions that are supported by experimental verification with various LLMs. This theoretical framework could help to more effectively predict and guide the evolution of LLMs in desired directions.

Via

Access Paper or Ask Questions

Practical Kernel Tests of Conditional Independence

Feb 20, 2024

Roman Pogodin, Antonin Schrab, Yazhe Li, Danica J. Sutherland, Arthur Gretton

Figure 1 for Practical Kernel Tests of Conditional Independence

Figure 2 for Practical Kernel Tests of Conditional Independence

Figure 3 for Practical Kernel Tests of Conditional Independence

Figure 4 for Practical Kernel Tests of Conditional Independence

Abstract:We describe a data-efficient, kernel-based approach to statistical testing of conditional independence. A major challenge of conditional independence testing, absent in tests of unconditional independence, is to obtain the correct test level (the specified upper bound on the rate of false positives), while still attaining competitive test power. Excess false positives arise due to bias in the test statistic, which is obtained using nonparametric kernel ridge regression. We propose three methods for bias control to correct the test level, based on data splitting, auxiliary data, and (where possible) simpler function classes. We show these combined strategies are effective both for synthetic and real-world data.

Via

Access Paper or Ask Questions