Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Steven Y. Feng

Michael Pokorny

A Unified Definition of Hallucination, Or: It's the World Model, Stupid

Dec 25, 2025

Emmy Liu, Varun Gangal, Chelsea Zou, Xiaoqi Huang, Michael Yu, Alex Chang, Zhuofu Tao, Sachin Kumar, Steven Y. Feng

Abstract:Despite numerous attempts to solve the issue of hallucination since the inception of neural language models, it remains a problem in even frontier large language models today. Why is this the case? We walk through definitions of hallucination used in the literature from a historical perspective up to the current day, and fold them into a single definition of hallucination, wherein different prior definitions focus on different aspects of our definition. At its core, we argue that hallucination is simply inaccurate (internal) world modeling, in a form where it is observable to the user (e.g., stating a fact which contradicts a knowledge base, or producing a summary which contradicts a known source). By varying the reference world model as well as the knowledge conflict policy (e.g., knowledge base vs. in-context), we arrive at the different existing definitions of hallucination present in the literature. We argue that this unified view is useful because it forces evaluations to make clear their assumed "world" or source of truth, clarifies what should and should not be called hallucination (as opposed to planning or reward/incentive-related errors), and provides a common language to compare benchmarks and mitigation techniques. Building on this definition, we outline plans for a family of benchmarks in which hallucinations are defined as mismatches with synthetic but fully specified world models in different environments, and sketch out how these benchmarks can use such settings to stress-test and improve the world modeling components of language models.

Via

Access Paper or Ask Questions

Humanity's Last Exam

Jan 24, 2025

Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra(+653 more)

Abstract:Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 3,000 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a significant gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.

* 25 pages, 6 figures

Via

Access Paper or Ask Questions

Is Child-Directed Speech Effective Training Data for Language Models?

Aug 07, 2024

Steven Y. Feng, Noah D. Goodman, Michael C. Frank

Figure 1 for Is Child-Directed Speech Effective Training Data for Language Models?

Figure 2 for Is Child-Directed Speech Effective Training Data for Language Models?

Figure 3 for Is Child-Directed Speech Effective Training Data for Language Models?

Figure 4 for Is Child-Directed Speech Effective Training Data for Language Models?

Abstract:While high-performing language models are typically trained on hundreds of billions of words, human children become fluent language users with a much smaller amount of data. What are the features of the data they receive, and how do these features support language modeling objectives? To investigate this question, we train GPT-2 models on 29M words of English-language child-directed speech and a new matched, synthetic dataset (TinyDialogues), comparing to a heterogeneous blend of datasets from the BabyLM challenge. We evaluate both the syntactic and semantic knowledge of these models using developmentally-inspired evaluations. Through pretraining experiments, we test whether the global developmental ordering or the local discourse ordering of children's training data support high performance relative to other datasets. The local properties of the data affect model results, but somewhat surprisingly, global properties do not. Further, child language input is not uniquely valuable for training language models. These findings support the hypothesis that, rather than proceeding from better data, children's learning is instead substantially more efficient than current language modeling techniques.

* Preprint. Code and data will be released soon

Via

Access Paper or Ask Questions

The BabyView dataset: High-resolution egocentric videos of infants' and young children's everyday experiences

Jun 14, 2024

Bria Long, Violet Xiang, Stefan Stojanov, Robert Z. Sparks, Zi Yin, Grace E. Keene, Alvin W. M. Tan, Steven Y. Feng, Chengxu Zhuang, Virginia A. Marchman(+2 more)

Figure 1 for The BabyView dataset: High-resolution egocentric videos of infants' and young children's everyday experiences

Figure 2 for The BabyView dataset: High-resolution egocentric videos of infants' and young children's everyday experiences

Figure 3 for The BabyView dataset: High-resolution egocentric videos of infants' and young children's everyday experiences

Figure 4 for The BabyView dataset: High-resolution egocentric videos of infants' and young children's everyday experiences

Abstract:Human children far exceed modern machine learning algorithms in their sample efficiency, achieving high performance in key domains with much less data than current models. This ''data gap'' is a key challenge both for building intelligent artificial systems and for understanding human development. Egocentric video capturing children's experience -- their ''training data'' -- is a key ingredient for comparison of humans and models and for the development of algorithmic innovations to bridge this gap. Yet there are few such datasets available, and extant data are low-resolution, have limited metadata, and importantly, represent only a small set of children's experiences. Here, we provide the first release of the largest developmental egocentric video dataset to date -- the BabyView dataset -- recorded using a high-resolution camera with a large vertical field-of-view and gyroscope/accelerometer data. This 493 hour dataset includes egocentric videos from children spanning 6 months - 5 years of age in both longitudinal, at-home contexts and in a preschool environment. We provide gold-standard annotations for the evaluation of speech transcription, speaker diarization, and human pose estimation, and evaluate models in each of these domains. We train self-supervised language and vision models and evaluate their transfer to out-of-distribution tasks including syntactic structure learning, object recognition, depth estimation, and image segmentation. Although performance in each scales with dataset size, overall performance is relatively lower than when models are trained on curated datasets, especially in the visual domain. Our dataset stands as an open challenge for robust, humanlike AI systems: how can such systems achieve human-levels of success on the same scale and distribution of training data as humans?

* 9 pages, 2 figures, 4 tables and SI. Submitted to NeurIPS Datasets and Benchmarks

Via

Access Paper or Ask Questions

CHARD: Clinical Health-Aware Reasoning Across Dimensions for Text Generation Models

Oct 09, 2022

Steven Y. Feng, Vivek Khetan, Bogdan Sacaleanu, Anatole Gershman, Eduard Hovy

Figure 1 for CHARD: Clinical Health-Aware Reasoning Across Dimensions for Text Generation Models

Figure 2 for CHARD: Clinical Health-Aware Reasoning Across Dimensions for Text Generation Models

Figure 3 for CHARD: Clinical Health-Aware Reasoning Across Dimensions for Text Generation Models

Figure 4 for CHARD: Clinical Health-Aware Reasoning Across Dimensions for Text Generation Models

Abstract:We motivate and introduce CHARD: Clinical Health-Aware Reasoning across Dimensions, to investigate the capability of text generation models to act as implicit clinical knowledge bases and generate free-flow textual explanations about various health-related conditions across several dimensions. We collect and present an associated dataset, CHARDat, consisting of explanations about 52 health conditions across three clinical dimensions. We conduct extensive experiments using BART and T5 along with data augmentation, and perform automatic, human, and qualitative analyses. We show that while our models can perform decently, CHARD is very challenging with strong potential for further exploration.

Via

Access Paper or Ask Questions

PINEAPPLE: Personifying INanimate Entities by Acquiring Parallel Personification data for Learning Enhanced generation

Sep 16, 2022

Sedrick Scott Keh, Kevin Lu, Varun Gangal, Steven Y. Feng, Harsh Jhamtani, Malihe Alikhani, Eduard Hovy

Figure 1 for PINEAPPLE: Personifying INanimate Entities by Acquiring Parallel Personification data for Learning Enhanced generation

Figure 2 for PINEAPPLE: Personifying INanimate Entities by Acquiring Parallel Personification data for Learning Enhanced generation

Figure 3 for PINEAPPLE: Personifying INanimate Entities by Acquiring Parallel Personification data for Learning Enhanced generation

Figure 4 for PINEAPPLE: Personifying INanimate Entities by Acquiring Parallel Personification data for Learning Enhanced generation

Abstract:A personification is a figure of speech that endows inanimate entities with properties and actions typically seen as requiring animacy. In this paper, we explore the task of personification generation. To this end, we propose PINEAPPLE: Personifying INanimate Entities by Acquiring Parallel Personification data for Learning Enhanced generation. We curate a corpus of personifications called PersonifCorp, together with automatically generated de-personified literalizations of these personifications. We demonstrate the usefulness of this parallel corpus by training a seq2seq model to personify a given literal input. Both automatic and human evaluations show that fine-tuning with PersonifCorp leads to significant gains in personification-related qualities such as animacy and interestingness. A detailed qualitative analysis also highlights key strengths and imperfections of PINEAPPLE over baselines, demonstrating a strong ability to generate diverse and creative personifications that enhance the overall appeal of a sentence.

* Accepted to COLING 2022; official Github repo at https://github.com/sedrickkeh/PINEAPPLE

Via

Access Paper or Ask Questions

PANCETTA: Phoneme Aware Neural Completion to Elicit Tongue Twisters Automatically

Sep 13, 2022

Sedrick Scott Keh, Steven Y. Feng, Varun Gangal, Malihe Alikhani, Eduard Hovy

Figure 1 for PANCETTA: Phoneme Aware Neural Completion to Elicit Tongue Twisters Automatically

Figure 2 for PANCETTA: Phoneme Aware Neural Completion to Elicit Tongue Twisters Automatically

Figure 3 for PANCETTA: Phoneme Aware Neural Completion to Elicit Tongue Twisters Automatically

Figure 4 for PANCETTA: Phoneme Aware Neural Completion to Elicit Tongue Twisters Automatically

Abstract:Tongue twisters are meaningful sentences that are difficult to pronounce. The process of automatically generating tongue twisters is challenging since the generated utterance must satisfy two conditions at once: phonetic difficulty and semantic meaning. Furthermore, phonetic difficulty is itself hard to characterize and is expressed in natural tongue twisters through a heterogeneous mix of phenomena such as alliteration and homophony. In this paper, we propose PANCETTA: Phoneme Aware Neural Completion to Elicit Tongue Twisters Automatically. We leverage phoneme representations to capture the notion of phonetic difficulty, and we train language models to generate original tongue twisters on two proposed task settings. To do this, we curate a dataset called PANCETTA, consisting of existing English tongue twisters. Through automatic and human evaluation, as well as qualitative analysis, we show that PANCETTA generates novel, phonetically difficult, fluent, and semantically meaningful tongue twisters.

* Official Github repo at https://github.com/sedrickkeh/PANCETTA

Via

Access Paper or Ask Questions

Retrieve, Caption, Generate: Visual Grounding for Enhancing Commonsense in Text Generation Models

Sep 08, 2021

Steven Y. Feng, Kevin Lu, Zhuofu Tao, Malihe Alikhani, Teruko Mitamura, Eduard Hovy, Varun Gangal

Figure 1 for Retrieve, Caption, Generate: Visual Grounding for Enhancing Commonsense in Text Generation Models

Figure 2 for Retrieve, Caption, Generate: Visual Grounding for Enhancing Commonsense in Text Generation Models

Figure 3 for Retrieve, Caption, Generate: Visual Grounding for Enhancing Commonsense in Text Generation Models

Figure 4 for Retrieve, Caption, Generate: Visual Grounding for Enhancing Commonsense in Text Generation Models

Abstract:We investigate the use of multimodal information contained in images as an effective method for enhancing the commonsense of Transformer models for text generation. We perform experiments using BART and T5 on concept-to-text generation, specifically the task of generative commonsense reasoning, or CommonGen. We call our approach VisCTG: Visually Grounded Concept-to-Text Generation. VisCTG involves captioning images representing appropriate everyday scenarios, and using these captions to enrich and steer the generation process. Comprehensive evaluation and analysis demonstrate that VisCTG noticeably improves model performance while successfully addressing several issues of the baseline generations, including poor commonsense, fluency, and specificity.

Via

Access Paper or Ask Questions

SAPPHIRE: Approaches for Enhanced Concept-to-Text Generation

Aug 15, 2021

Steven Y. Feng, Jessica Huynh, Chaitanya Narisetty, Eduard Hovy, Varun Gangal

Figure 1 for SAPPHIRE: Approaches for Enhanced Concept-to-Text Generation

Figure 2 for SAPPHIRE: Approaches for Enhanced Concept-to-Text Generation

Figure 3 for SAPPHIRE: Approaches for Enhanced Concept-to-Text Generation

Figure 4 for SAPPHIRE: Approaches for Enhanced Concept-to-Text Generation

Abstract:We motivate and propose a suite of simple but effective improvements for concept-to-text generation called SAPPHIRE: Set Augmentation and Post-hoc PHrase Infilling and REcombination. We demonstrate their effectiveness on generative commonsense reasoning, a.k.a. the CommonGen task, through experiments using both BART and T5 models. Through extensive automatic and human evaluation, we show that SAPPHIRE noticeably improves model performance. An in-depth qualitative analysis illustrates that SAPPHIRE effectively addresses many issues of the baseline model generations, including lack of commonsense, insufficient specificity, and poor fluency.

* INLG 2021. Code available at https://github.com/styfeng/SAPPHIRE

Via

Access Paper or Ask Questions

A Survey of Data Augmentation Approaches for NLP

May 29, 2021

Steven Y. Feng, Varun Gangal, Jason Wei, Sarath Chandar, Soroush Vosoughi, Teruko Mitamura, Eduard Hovy

Figure 1 for A Survey of Data Augmentation Approaches for NLP

Figure 2 for A Survey of Data Augmentation Approaches for NLP

Figure 3 for A Survey of Data Augmentation Approaches for NLP

Figure 4 for A Survey of Data Augmentation Approaches for NLP

Abstract:Data augmentation has recently seen increased interest in NLP due to more work in low-resource domains, new tasks, and the popularity of large-scale neural networks that require large amounts of training data. Despite this recent upsurge, this area is still relatively underexplored, perhaps due to the challenges posed by the discrete nature of language data. In this paper, we present a comprehensive and unifying survey of data augmentation for NLP by summarizing the literature in a structured manner. We first introduce and motivate data augmentation for NLP, and then discuss major methodologically representative approaches. Next, we highlight techniques that are used for popular NLP applications and tasks. We conclude by outlining current challenges and directions for future research. Overall, our paper aims to clarify the landscape of existing literature in data augmentation for NLP and motivate additional work in this area. We also present a GitHub repository with a paper list that will be continuously updated at https://github.com/styfeng/DataAug4NLP

* Accepted to ACL 2021 Findings. GitHub repo with paper list at https://github.com/styfeng/DataAug4NLP

Via

Access Paper or Ask Questions