Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hangfeng He

Replicating Human Motivated Reasoning Studies with LLMs

Jan 22, 2026

Neeley Pate, Adiba Mahbub Proma, Hangfeng He, James N. Druckman, Daniel Molden, Gourab Ghoshal, Ehsan Hoque

Abstract:Motivated reasoning -- the idea that individuals processing information may be motivated to reach a certain conclusion, whether it be accurate or predetermined -- has been well-explored as a human phenomenon. However, it is unclear whether base LLMs mimic these motivational changes. Replicating 4 prior political motivated reasoning studies, we find that base LLM behavior does not align with expected human behavior. Furthermore, base LLM behavior across models shares some similarities, such as smaller standard deviations and inaccurate argument strength assessments. We emphasize the importance of these findings for researchers using LLMs to automate tasks such as survey data collection and argument assessment.

Via

Access Paper or Ask Questions

Mitigating Hallucinations in Multimodal Spatial Relations through Constraint-Aware Prompting

Feb 12, 2025

Jiarui Wu, Zhuo Liu, Hangfeng He

Abstract:Spatial relation hallucinations pose a persistent challenge in large vision-language models (LVLMs), leading to generate incorrect predictions about object positions and spatial configurations within an image. To address this issue, we propose a constraint-aware prompting framework designed to reduce spatial relation hallucinations. Specifically, we introduce two types of constraints: (1) bidirectional constraint, which ensures consistency in pairwise object relations, and (2) transitivity constraint, which enforces relational dependence across multiple objects. By incorporating these constraints, LVLMs can produce more spatially coherent and consistent outputs. We evaluate our method on three widely-used spatial relation datasets, demonstrating performance improvements over existing approaches. Additionally, a systematic analysis of various bidirectional relation analysis choices and transitivity reference selections highlights greater possibilities of our methods in incorporating constraints to mitigate spatial relation hallucinations.

* 19 pages, accepted to NAACL Findings

Via

Access Paper or Ask Questions

Same Company, Same Signal: The Role of Identity in Earnings Call Transcripts

Dec 23, 2024

Ding Yu, Zhuo Liu, Hangfeng He

Figure 1 for Same Company, Same Signal: The Role of Identity in Earnings Call Transcripts

Figure 2 for Same Company, Same Signal: The Role of Identity in Earnings Call Transcripts

Figure 3 for Same Company, Same Signal: The Role of Identity in Earnings Call Transcripts

Figure 4 for Same Company, Same Signal: The Role of Identity in Earnings Call Transcripts

Abstract:Post-earnings volatility prediction is critical for investors, with previous works often leveraging earnings call transcripts under the assumption that their rich semantics contribute significantly. To further investigate how transcripts impact volatility, we introduce DEC, a dataset featuring accurate volatility calculations enabled by the previously overlooked beforeAfterMarket attribute and dense ticker coverage. Unlike established benchmarks, where each ticker has only around two earnings, DEC provides 20 earnings records per ticker. Using DEC, we reveal that post-earnings volatility undergoes significant shifts, with each ticker displaying a distinct volatility distribution. To leverage historical post-earnings volatility and capture ticker-specific patterns, we propose two training-free baselines: Post-earnings Volatility (PEV) and Same-ticker Post-earnings Volatility (STPEV). These baselines surpass all transcripts-based models on DEC as well as on established benchmarks. Additionally, we demonstrate that current transcript representations predominantly capture ticker identity rather than offering financially meaningful insights specific to each earnings. This is evidenced by two key observations: earnings representations from the same ticker exhibit significantly higher similarity compared to those from different tickers, and predictions from transcript-based models show strong correlations with prior post-earnings volatility.

Via

Access Paper or Ask Questions

On the Role of Model Prior in Real-World Inductive Reasoning

Dec 18, 2024

Zhuo Liu, Ding Yu, Hangfeng He

Figure 1 for On the Role of Model Prior in Real-World Inductive Reasoning

Figure 2 for On the Role of Model Prior in Real-World Inductive Reasoning

Figure 3 for On the Role of Model Prior in Real-World Inductive Reasoning

Figure 4 for On the Role of Model Prior in Real-World Inductive Reasoning

Abstract:Large Language Models (LLMs) show impressive inductive reasoning capabilities, enabling them to generate hypotheses that could generalize effectively to new instances when guided by in-context demonstrations. However, in real-world applications, LLMs' hypothesis generation is not solely determined by these demonstrations but is significantly shaped by task-specific model priors. Despite their critical influence, the distinct contributions of model priors versus demonstrations to hypothesis generation have been underexplored. This study bridges this gap by systematically evaluating three inductive reasoning strategies across five real-world tasks with three LLMs. Our empirical findings reveal that, hypothesis generation is primarily driven by the model's inherent priors; removing demonstrations results in minimal loss of hypothesis quality and downstream usage. Further analysis shows the result is consistent across various label formats with different label configurations, and prior is hard to override, even under flipped labeling. These insights advance our understanding of the dynamics of hypothesis generation in LLMs and highlight the potential for better utilizing model priors in real-world inductive reasoning tasks.

Via

Access Paper or Ask Questions

MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models

Oct 13, 2024

Hang Hua, Yunlong Tang, Ziyun Zeng, Liangliang Cao, Zhengyuan Yang, Hangfeng He, Chenliang Xu, Jiebo Luo

Figure 1 for MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models

Figure 2 for MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models

Figure 3 for MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models

Figure 4 for MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models

Abstract:The advent of large Vision-Language Models (VLMs) has significantly advanced multimodal understanding, enabling more sophisticated and accurate integration of visual and textual information across various tasks, including image and video captioning, visual question answering, and cross-modal retrieval. Despite VLMs' superior capabilities, researchers lack a comprehensive understanding of their compositionality -- the ability to understand and produce novel combinations of known visual and textual components. Prior benchmarks provide only a relatively rough compositionality evaluation from the perspectives of objects, relations, and attributes while neglecting deeper reasoning about object interactions, counting, and complex compositions. However, compositionality is a critical ability that facilitates coherent reasoning and understanding across modalities for VLMs. To address this limitation, we propose MMCOMPOSITION, a novel human-annotated benchmark for comprehensively and accurately evaluating VLMs' compositionality. Our proposed benchmark serves as a complement to these earlier works. With MMCOMPOSITION, we can quantify and explore the compositionality of the mainstream VLMs. Surprisingly, we find GPT-4o's compositionality inferior to the best open-source model, and we analyze the underlying reasons. Our experimental analysis reveals the limitations of VLMs in fine-grained compositional perception and reasoning, and points to areas for improvement in VLM design and training. Resources available at: https://hanghuacs.github.io/MMComposition/

* 21 pages, 15 figures

Via

Access Paper or Ask Questions

A Law of Next-Token Prediction in Large Language Models

Aug 24, 2024

Hangfeng He, Weijie J. Su

Figure 1 for A Law of Next-Token Prediction in Large Language Models

Figure 2 for A Law of Next-Token Prediction in Large Language Models

Figure 3 for A Law of Next-Token Prediction in Large Language Models

Figure 4 for A Law of Next-Token Prediction in Large Language Models

Abstract:Large language models (LLMs) have been widely employed across various application domains, yet their black-box nature poses significant challenges to understanding how these models process input data internally to make predictions. In this paper, we introduce a precise and quantitative law that governs the learning of contextualized token embeddings through intermediate layers in pre-trained LLMs for next-token prediction. Our findings reveal that each layer contributes equally to enhancing prediction accuracy, from the lowest to the highest layer -- a universal phenomenon observed across a diverse array of open-source LLMs, built on architectures such as Transformer, RWKV, and Mamba. We demonstrate that this law offers new perspectives and insights to inform and guide practices in LLM development and applications, including model scaling, pre-training tasks, and information flow. Overall, our law enables more fine-grained approaches to the design, training, and interpretation of LLMs through scrutinizing their internal data processing mechanisms.

Via

Access Paper or Ask Questions

An Empirical Analysis on Large Language Models in Debate Evaluation

Jun 04, 2024

Xinyi Liu, Pinxin Liu, Hangfeng He

Figure 1 for An Empirical Analysis on Large Language Models in Debate Evaluation

Figure 2 for An Empirical Analysis on Large Language Models in Debate Evaluation

Figure 3 for An Empirical Analysis on Large Language Models in Debate Evaluation

Figure 4 for An Empirical Analysis on Large Language Models in Debate Evaluation

Abstract:In this study, we investigate the capabilities and inherent biases of advanced large language models (LLMs) such as GPT-3.5 and GPT-4 in the context of debate evaluation. We discover that LLM's performance exceeds humans and surpasses the performance of state-of-the-art methods fine-tuned on extensive datasets in debate evaluation. We additionally explore and analyze biases present in LLMs, including positional bias, lexical bias, order bias, which may affect their evaluative judgments. Our findings reveal a consistent bias in both GPT-3.5 and GPT-4 towards the second candidate response presented, attributed to prompt design. We also uncover lexical biases in both GPT-3.5 and GPT-4, especially when label sets carry connotations such as numerical or sequential, highlighting the critical need for careful label verbalizer selection in prompt design. Additionally, our analysis indicates a tendency of both models to favor the debate's concluding side as the winner, suggesting an end-of-discussion bias.

* Accepted to ACL 2024 main

Via

Access Paper or Ask Questions

Unveiling Divergent Inductive Biases of LLMs on Temporal Data

Apr 01, 2024

Sindhu Kishore, Hangfeng He

Figure 1 for Unveiling Divergent Inductive Biases of LLMs on Temporal Data

Figure 2 for Unveiling Divergent Inductive Biases of LLMs on Temporal Data

Figure 3 for Unveiling Divergent Inductive Biases of LLMs on Temporal Data

Figure 4 for Unveiling Divergent Inductive Biases of LLMs on Temporal Data

Abstract:Unraveling the intricate details of events in natural language necessitates a subtle understanding of temporal dynamics. Despite the adeptness of Large Language Models (LLMs) in discerning patterns and relationships from data, their inherent comprehension of temporal dynamics remains a formidable challenge. This research meticulously explores these intrinsic challenges within LLMs, with a specific emphasis on evaluating the performance of GPT-3.5 and GPT-4 models in the analysis of temporal data. Employing two distinct prompt types, namely Question Answering (QA) format and Textual Entailment (TE) format, our analysis probes into both implicit and explicit events. The findings underscore noteworthy trends, revealing disparities in the performance of GPT-3.5 and GPT-4. Notably, biases toward specific temporal relationships come to light, with GPT-3.5 demonstrating a preference for "AFTER'' in the QA format for both implicit and explicit events, while GPT-4 leans towards "BEFORE''. Furthermore, a consistent pattern surfaces wherein GPT-3.5 tends towards "TRUE'', and GPT-4 exhibits a preference for "FALSE'' in the TE format for both implicit and explicit events. This persistent discrepancy between GPT-3.5 and GPT-4 in handling temporal data highlights the intricate nature of inductive bias in LLMs, suggesting that the evolution of these models may not merely mitigate bias but may introduce new layers of complexity.

Via

Access Paper or Ask Questions

SocREval: Large Language Models with the Socratic Method for Reference-Free Reasoning Evaluation

Sep 29, 2023

Hangfeng He, Hongming Zhang, Dan Roth

Figure 1 for SocREval: Large Language Models with the Socratic Method for Reference-Free Reasoning Evaluation

Figure 2 for SocREval: Large Language Models with the Socratic Method for Reference-Free Reasoning Evaluation

Figure 3 for SocREval: Large Language Models with the Socratic Method for Reference-Free Reasoning Evaluation

Figure 4 for SocREval: Large Language Models with the Socratic Method for Reference-Free Reasoning Evaluation

Abstract:To comprehensively assess the capacity of current models for complex reasoning, it is crucial to assess their step-by-step reasoning in a scalable manner. Established reference-based evaluation metrics rely on human-annotated reasoning chains to assess the model-derived chains. However, such ``gold-standard'' human-written reasoning chains may not be unique and their acquisition is often labor-intensive. Existing reference-free reasoning metrics eliminate the need for human-crafted reasoning chains as references, but they typically require fine-tuning on datasets with human-derived reasoning chains, which complicates the process and raises concerns regarding generalizability across diverse datasets. To address these challenges, we harness GPT-4 to automatically evaluate reasoning chain quality, obviating the need for human-crafted references. Leveraging the Socratic method, we devise tailored prompts to enhance reference-free reasoning evaluation, which we term SocREval (Socratic method for Reasoning Evaluation). Empirical results from four human annotated datasets reveal that SocREval significantly improves GPT-4's performance, surpassing existing reference-free and reference-based reasoning evaluation metrics. Beyond its demonstrated efficacy, our proposed framework, large language models (LLMs) with the Socratic method, proves to be both cost-efficient and robust to prompt writing and example selection, as substantiated by our in-depth analysis.

Via

Access Paper or Ask Questions

On Regularization and Inference with Label Constraints

Jul 08, 2023

Kaifu Wang, Hangfeng He, Tin D. Nguyen, Piyush Kumar, Dan Roth

Figure 1 for On Regularization and Inference with Label Constraints

Figure 2 for On Regularization and Inference with Label Constraints

Abstract:Prior knowledge and symbolic rules in machine learning are often expressed in the form of label constraints, especially in structured prediction problems. In this work, we compare two common strategies for encoding label constraints in a machine learning pipeline, regularization with constraints and constrained inference, by quantifying their impact on model performance. For regularization, we show that it narrows the generalization gap by precluding models that are inconsistent with the constraints. However, its preference for small violations introduces a bias toward a suboptimal model. For constrained inference, we show that it reduces the population risk by correcting a model's violation, and hence turns the violation into an advantage. Given these differences, we further explore the use of two approaches together and propose conditions for constrained inference to compensate for the bias introduced by regularization, aiming to improve both the model complexity and optimal risk.

Via

Access Paper or Ask Questions