Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ameya Godbole

Hubble: a Model Suite to Advance the Study of LLM Memorization

Oct 22, 2025

Johnny Tian-Zheng Wei, Ameya Godbole, Mohammad Aflah Khan, Ryan Wang, Xiaoyuan Zhu, James Flemings, Nitya Kashyap, Krishna P. Gummadi, Willie Neiswanger, Robin Jia

Abstract:We present Hubble, a suite of fully open-source large language models (LLMs) for the scientific study of LLM memorization. Hubble models come in standard and perturbed variants: standard models are pretrained on a large English corpus, and perturbed models are trained in the same way but with controlled insertion of text (e.g., book passages, biographies, and test sets) designed to emulate key memorization risks. Our core release includes 8 models -- standard and perturbed models with 1B or 8B parameters, pretrained on 100B or 500B tokens -- establishing that memorization risks are determined by the frequency of sensitive data relative to size of the training corpus (i.e., a password appearing once in a smaller corpus is memorized better than the same password in a larger corpus). Our release also includes 6 perturbed models with text inserted at different pretraining phases, showing that sensitive data without continued exposure can be forgotten. These findings suggest two best practices for addressing memorization risks: to dilute sensitive data by increasing the size of the training corpus, and to order sensitive data to appear earlier in training. Beyond these general empirical findings, Hubble enables a broad range of memorization research; for example, analyzing the biographies reveals how readily different types of private information are memorized. We also demonstrate that the randomized insertions in Hubble make it an ideal testbed for membership inference and machine unlearning, and invite the community to further explore, benchmark, and build upon our work.

Via

Access Paper or Ask Questions

Interrogating LLM design under a fair learning doctrine

Feb 22, 2025

Johnny Tian-Zheng Wei, Maggie Wang, Ameya Godbole, Jonathan H. Choi, Robin Jia

Figure 1 for Interrogating LLM design under a fair learning doctrine

Figure 2 for Interrogating LLM design under a fair learning doctrine

Figure 3 for Interrogating LLM design under a fair learning doctrine

Figure 4 for Interrogating LLM design under a fair learning doctrine

Abstract:The current discourse on large language models (LLMs) and copyright largely takes a "behavioral" perspective, focusing on model outputs and evaluating whether they are substantially similar to training data. However, substantial similarity is difficult to define algorithmically and a narrow focus on model outputs is insufficient to address all copyright risks. In this interdisciplinary work, we take a complementary "structural" perspective and shift our focus to how LLMs are trained. We operationalize a notion of "fair learning" by measuring whether any training decision substantially affected the model's memorization. As a case study, we deconstruct Pythia, an open-source LLM, and demonstrate the use of causal and correlational analyses to make factual determinations about Pythia's training decisions. By proposing a legal standard for fair learning and connecting memorization analyses to this standard, we identify how judges may advance the goals of copyright law through adjudication. Finally, we discuss how a fair learning standard might evolve to enhance its clarity by becoming more rule-like and incorporating external technical guidelines.

Via

Access Paper or Ask Questions

Verify with Caution: The Pitfalls of Relying on Imperfect Factuality Metrics

Jan 24, 2025

Ameya Godbole, Robin Jia

Figure 1 for Verify with Caution: The Pitfalls of Relying on Imperfect Factuality Metrics

Figure 2 for Verify with Caution: The Pitfalls of Relying on Imperfect Factuality Metrics

Figure 3 for Verify with Caution: The Pitfalls of Relying on Imperfect Factuality Metrics

Figure 4 for Verify with Caution: The Pitfalls of Relying on Imperfect Factuality Metrics

Abstract:Improvements in large language models have led to increasing optimism that they can serve as reliable evaluators of natural language generation outputs. In this paper, we challenge this optimism by thoroughly re-evaluating five state-of-the-art factuality metrics on a collection of 11 datasets for summarization, retrieval-augmented generation, and question answering. We find that these evaluators are inconsistent with each other and often misestimate system-level performance, both of which can lead to a variety of pitfalls. We further show that these metrics exhibit biases against highly paraphrased outputs and outputs that draw upon faraway parts of the source documents. We urge users of these factuality metrics to proceed with caution and manually validate the reliability of these metrics in their domain of interest before proceeding.

Via

Access Paper or Ask Questions

Analysis of Plan-based Retrieval for Grounded Text Generation

Aug 20, 2024

Ameya Godbole, Nicholas Monath, Seungyeon Kim, Ankit Singh Rawat, Andrew McCallum, Manzil Zaheer

Figure 1 for Analysis of Plan-based Retrieval for Grounded Text Generation

Figure 2 for Analysis of Plan-based Retrieval for Grounded Text Generation

Figure 3 for Analysis of Plan-based Retrieval for Grounded Text Generation

Figure 4 for Analysis of Plan-based Retrieval for Grounded Text Generation

Abstract:In text generation, hallucinations refer to the generation of seemingly coherent text that contradicts established knowledge. One compelling hypothesis is that hallucinations occur when a language model is given a generation task outside its parametric knowledge (due to rarity, recency, domain, etc.). A common strategy to address this limitation is to infuse the language models with retrieval mechanisms, providing the model with relevant knowledge for the task. In this paper, we leverage the planning capabilities of instruction-tuned LLMs and analyze how planning can be used to guide retrieval to further reduce the frequency of hallucinations. We empirically evaluate several variations of our proposed approach on long-form text generation tasks. By improving the coverage of relevant facts, plan-guided retrieval and generation can produce more informative responses while providing a higher rate of attribution to source documents.

Via

Access Paper or Ask Questions

SCENE: Self-Labeled Counterfactuals for Extrapolating to Negative Examples

May 13, 2023

Deqing Fu, Ameya Godbole, Robin Jia

Abstract:Detecting negatives (such as non-entailment relationships, unanswerable questions, and false claims) is an important and challenging aspect of many natural language understanding tasks. Though manually collecting challenging negative examples can help models detect them, it is both costly and domain-specific. In this work, we propose Self-labeled Counterfactuals for Extrapolating to Negative Examples (SCENE), an automatic method for synthesizing training data that greatly improves models' ability to detect challenging negative examples. In contrast with standard data augmentation, which synthesizes new examples for existing labels, SCENE can synthesize negative examples zero-shot from only positive ones. Given a positive example, SCENE perturbs it with a mask infilling model, then determines whether the resulting example is negative based on a self-training heuristic. With access to only answerable training examples, SCENE can close 69.6% of the performance gap on SQuAD 2.0, a dataset where half of the evaluation examples are unanswerable, compared to a model trained on SQuAD 2.0. Our method also extends to boolean question answering and recognizing textual entailment, and improves generalization from SQuAD to ACE-whQA, an out-of-domain extractive QA benchmark.

Via

Access Paper or Ask Questions

Benchmarking Long-tail Generalization with Likelihood Splits

Oct 13, 2022

Ameya Godbole, Robin Jia

Figure 1 for Benchmarking Long-tail Generalization with Likelihood Splits

Figure 2 for Benchmarking Long-tail Generalization with Likelihood Splits

Figure 3 for Benchmarking Long-tail Generalization with Likelihood Splits

Figure 4 for Benchmarking Long-tail Generalization with Likelihood Splits

Abstract:In order to reliably process natural language, NLP systems must generalize to the long tail of rare utterances. We propose a method to create challenging benchmarks that require generalizing to the tail of the distribution by re-splitting existing datasets. We create 'Likelihood splits' where examples that are assigned lower likelihood by a pre-trained language model (LM) are placed in the test set, and more likely examples are in the training set. This simple approach can be customized to construct meaningful train-test splits for a wide range of tasks. Likelihood splits are more challenging than random splits: relative error rates of state-of-the-art models on our splits increase by 59% for semantic parsing on Spider, 77% for natural language inference on SNLI, and 38% for yes/no question answering on BoolQ compared with the corresponding random splits. Moreover, Likelihood splits create fairer benchmarks than adversarial filtering; when the LM used to create the splits is used as the task model, our splits do not adversely penalize the LM.

Via

Access Paper or Ask Questions

Knowledge Base Question Answering by Case-based Reasoning over Subgraphs

Feb 22, 2022

Rajarshi Das, Ameya Godbole, Ankita Naik, Elliot Tower, Robin Jia, Manzil Zaheer, Hannaneh Hajishirzi, Andrew McCallum

Figure 1 for Knowledge Base Question Answering by Case-based Reasoning over Subgraphs

Figure 2 for Knowledge Base Question Answering by Case-based Reasoning over Subgraphs

Figure 3 for Knowledge Base Question Answering by Case-based Reasoning over Subgraphs

Figure 4 for Knowledge Base Question Answering by Case-based Reasoning over Subgraphs

Abstract:Question answering (QA) over real-world knowledge bases (KBs) is challenging because of the diverse (essentially unbounded) types of reasoning patterns needed. However, we hypothesize in a large KB, reasoning patterns required to answer a query type reoccur for various entities in their respective subgraph neighborhoods. Leveraging this structural similarity between local neighborhoods of different subgraphs, we introduce a semiparametric model with (i) a nonparametric component that for each query, dynamically retrieves other similar $k$-nearest neighbor (KNN) training queries along with query-specific subgraphs and (ii) a parametric component that is trained to identify the (latent) reasoning patterns from the subgraphs of KNN queries and then apply it to the subgraph of the target query. We also propose a novel algorithm to select a query-specific compact subgraph from within the massive knowledge graph (KG), allowing us to scale to full Freebase KG containing billions of edges. We show that our model answers queries requiring complex reasoning patterns more effectively than existing KG completion algorithms. The proposed model outperforms or performs competitively with state-of-the-art models on several KBQA benchmarks.

Via

Access Paper or Ask Questions

Case-based Reasoning for Natural Language Queries over Knowledge Bases

Apr 18, 2021

Rajarshi Das, Manzil Zaheer, Dung Thai, Ameya Godbole, Ethan Perez, Jay-Yoon Lee, Lizhen Tan, Lazaros Polymenakos, Andrew McCallum

Figure 1 for Case-based Reasoning for Natural Language Queries over Knowledge Bases

Figure 2 for Case-based Reasoning for Natural Language Queries over Knowledge Bases

Figure 3 for Case-based Reasoning for Natural Language Queries over Knowledge Bases

Figure 4 for Case-based Reasoning for Natural Language Queries over Knowledge Bases

Abstract:It is often challenging for a system to solve a new complex problem from scratch, but much easier if the system can access other similar problems and description of their solutions -- a paradigm known as case-based reasoning (CBR). We propose a neuro-symbolic CBR approach for question answering over large knowledge bases (CBR-KBQA). While the idea of CBR is tempting, composing a solution from cases is nontrivial, when individual cases only contain partial logic to the full solution. To resolve this, CBR-KBQA consists of two modules: a non-parametric memory that stores cases (question and logical forms) and a parametric model which can generate logical forms by retrieving relevant cases from memory. Through experiments, we show that CBR-KBQA can effectively derive novel combination of relations not presented in case memory that is required to answer compositional questions. On several KBQA datasets that test compositional generalization, CBR-KBQA achieves competitive performance. For example, on the challenging ComplexWebQuestions dataset, CBR-KBQA outperforms the current state of the art by 11% accuracy. Furthermore, we show that CBR-KBQA is capable of using new cases \emph{without} any further training. Just by incorporating few human-labeled examples in the non-parametric case memory, CBR-KBQA is able to successfully generate queries containing unseen KB relations.

Via

Access Paper or Ask Questions

Probabilistic Case-based Reasoning for Open-World Knowledge Graph Completion

Oct 09, 2020

Rajarshi Das, Ameya Godbole, Nicholas Monath, Manzil Zaheer, Andrew McCallum

Figure 1 for Probabilistic Case-based Reasoning for Open-World Knowledge Graph Completion

Figure 2 for Probabilistic Case-based Reasoning for Open-World Knowledge Graph Completion

Figure 3 for Probabilistic Case-based Reasoning for Open-World Knowledge Graph Completion

Figure 4 for Probabilistic Case-based Reasoning for Open-World Knowledge Graph Completion

Abstract:A case-based reasoning (CBR) system solves a new problem by retrieving `cases' that are similar to the given problem. If such a system can achieve high accuracy, it is appealing owing to its simplicity, interpretability, and scalability. In this paper, we demonstrate that such a system is achievable for reasoning in knowledge-bases (KBs). Our approach predicts attributes for an entity by gathering reasoning paths from similar entities in the KB. Our probabilistic model estimates the likelihood that a path is effective at answering a query about the given entity. The parameters of our model can be efficiently computed using simple path statistics and require no iterative optimization. Our model is non-parametric, growing dynamically as new entities and relations are added to the KB. On several benchmark datasets our approach significantly outperforms other rule learning approaches and performs comparably to state-of-the-art embedding-based approaches. Furthermore, we demonstrate the effectiveness of our model in an "open-world" setting where new entities arrive in an online fashion, significantly outperforming state-of-the-art approaches and nearly matching the best offline method. Code available at https://github.com/ameyagodbole/Prob-CBR

Via

Access Paper or Ask Questions

A Simple Approach to Case-Based Reasoning in Knowledge Bases

Jun 25, 2020

Rajarshi Das, Ameya Godbole, Shehzaad Dhuliawala, Manzil Zaheer, Andrew McCallum

Figure 1 for A Simple Approach to Case-Based Reasoning in Knowledge Bases

Figure 2 for A Simple Approach to Case-Based Reasoning in Knowledge Bases

Figure 3 for A Simple Approach to Case-Based Reasoning in Knowledge Bases

Figure 4 for A Simple Approach to Case-Based Reasoning in Knowledge Bases

Abstract:We present a surprisingly simple yet accurate approach to reasoning in knowledge graphs (KGs) that requires \emph{no training}, and is reminiscent of case-based reasoning in classical artificial intelligence (AI). Consider the task of finding a target entity given a source entity and a binary relation. Our non-parametric approach derives crisp logical rules for each query by finding multiple \textit{graph path patterns} that connect similar source entities through the given relation. Using our method, we obtain new state-of-the-art accuracy, outperforming all previous models, on NELL-995 and FB-122. We also demonstrate that our model is robust in low data settings, outperforming recently proposed meta-learning approaches

Via

Access Paper or Ask Questions