Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mayank Kejriwal

Code-Driven Planning in Grid Worlds with Large Language Models

May 15, 2025

Ashwath Vaithinathan Aravindan, Zhisheng Tang, Mayank Kejriwal

Abstract:We propose an iterative programmatic planning (IPP) framework for solving grid-based tasks by synthesizing interpretable agent policies expressed in code using large language models (LLMs). Instead of relying on traditional search or reinforcement learning, our approach uses code generation as policy synthesis, where the LLM outputs executable programs that map environment states to action sequences. Our proposed architecture incorporates several prompting strategies, including direct code generation, pseudocode-conditioned refinement, and curriculum-based prompting, but also includes an iterative refinement mechanism that updates code based on task performance feedback. We evaluate our approach using six leading LLMs and two challenging grid-based benchmarks (GRASP and MiniGrid). Our IPP framework demonstrates improvements over direct code generation ranging from 10\% to as much as 10x across five of the six models and establishes a new state-of-the-art result for GRASP. IPP is found to significantly outperform direct elicitation of a solution from GPT-o3-mini (by 63\% on MiniGrid to 116\% on GRASP), demonstrating the viability of the overall approach. Computational costs of all code generation approaches are similar. While code generation has a higher initial prompting cost compared to direct solution elicitation (\$0.08 per task vs. \$0.002 per instance for GPT-o3-mini), the code can be reused for any number of instances, making the amortized cost significantly lower (by 400x on GPT-o3-mini across the complete GRASP benchmark).

Via

Access Paper or Ask Questions

Navigating Semantic Relations: Challenges for Language Models in Abstract Common-Sense Reasoning

Feb 19, 2025

Cole Gawin, Yidan Sun, Mayank Kejriwal

Abstract:Large language models (LLMs) have achieved remarkable performance in generating human-like text and solving reasoning tasks of moderate complexity, such as question-answering and mathematical problem-solving. However, their capabilities in tasks requiring deeper cognitive skills, such as common-sense understanding and abstract reasoning, remain under-explored. In this paper, we systematically evaluate abstract common-sense reasoning in LLMs using the ConceptNet knowledge graph. We propose two prompting approaches: instruct prompting, where models predict plausible semantic relationships based on provided definitions, and few-shot prompting, where models identify relations using examples as guidance. Our experiments with the gpt-4o-mini model show that in instruct prompting, consistent performance is obtained when ranking multiple relations but with substantial decline when the model is restricted to predicting only one relation. In few-shot prompting, the model's accuracy improves significantly when selecting from five relations rather than the full set, although with notable bias toward certain relations. These results suggest significant gaps still, even in commercially used LLMs' abstract common-sense reasoning abilities, compared to human-level understanding. However, the findings also highlight the promise of careful prompt engineering, based on selective retrieval, for obtaining better performance.

* 5 pages, 3 figures, ACM Web Conference 2025

Via

Access Paper or Ask Questions

Humanlike Cognitive Patterns as Emergent Phenomena in Large Language Models

Dec 20, 2024

Zhisheng Tang, Mayank Kejriwal

Abstract:Research on emergent patterns in Large Language Models (LLMs) has gained significant traction in both psychology and artificial intelligence, motivating the need for a comprehensive review that offers a synthesis of this complex landscape. In this article, we systematically review LLMs' capabilities across three important cognitive domains: decision-making biases, reasoning, and creativity. We use empirical studies drawing on established psychological tests and compare LLMs' performance to human benchmarks. On decision-making, our synthesis reveals that while LLMs demonstrate several human-like biases, some biases observed in humans are absent, indicating cognitive patterns that only partially align with human decision-making. On reasoning, advanced LLMs like GPT-4 exhibit deliberative reasoning akin to human System-2 thinking, while smaller models fall short of human-level performance. A distinct dichotomy emerges in creativity: while LLMs excel in language-based creative tasks, such as storytelling, they struggle with divergent thinking tasks that require real-world context. Nonetheless, studies suggest that LLMs hold considerable potential as collaborators, augmenting creativity in human-machine problem-solving settings. Discussing key limitations, we also offer guidance for future research in areas such as memory, attention, and open-source model development.

Via

Access Paper or Ask Questions

SelECT-SQL: Self-correcting ensemble Chain-of-Thought for Text-to-SQL

Sep 16, 2024

Ke Shen, Mayank Kejriwal

Figure 1 for SelECT-SQL: Self-correcting ensemble Chain-of-Thought for Text-to-SQL

Figure 2 for SelECT-SQL: Self-correcting ensemble Chain-of-Thought for Text-to-SQL

Figure 3 for SelECT-SQL: Self-correcting ensemble Chain-of-Thought for Text-to-SQL

Figure 4 for SelECT-SQL: Self-correcting ensemble Chain-of-Thought for Text-to-SQL

Abstract:In recent years,Text-to-SQL, the problem of automatically converting questions posed in natural language to formal SQL queries, has emerged as an important problem at the intersection of natural language processing and data management research. Large language models (LLMs) have delivered impressive performance when used in an off-the-shelf performance, but still fall significantly short of expected expert-level performance. Errors are especially probable when a nuanced understanding is needed of database schemas, questions, and SQL clauses to do proper Text-to-SQL conversion. We introduce SelECT-SQL, a novel in-context learning solution that uses an algorithmic combination of chain-of-thought (CoT) prompting, self-correction, and ensemble methods to yield a new state-of-the-art result on challenging Text-to-SQL benchmarks. Specifically, when configured using GPT-3.5-Turbo as the base LLM, SelECT-SQL achieves 84.2% execution accuracy on the Spider leaderboard's development set, exceeding both the best results of other baseline GPT-3.5-Turbo-based solutions (81.1%), and the peak performance (83.5%) of the GPT-4 result reported on the leaderboard.

Via

Access Paper or Ask Questions

Defining and Evaluating Decision and Composite Risk in Language Models Applied to Natural Language Inference

Aug 04, 2024

Ke Shen, Mayank Kejriwal

Figure 1 for Defining and Evaluating Decision and Composite Risk in Language Models Applied to Natural Language Inference

Figure 2 for Defining and Evaluating Decision and Composite Risk in Language Models Applied to Natural Language Inference

Figure 3 for Defining and Evaluating Decision and Composite Risk in Language Models Applied to Natural Language Inference

Figure 4 for Defining and Evaluating Decision and Composite Risk in Language Models Applied to Natural Language Inference

Abstract:Despite their impressive performance, large language models (LLMs) such as ChatGPT are known to pose important risks. One such set of risks arises from misplaced confidence, whether over-confidence or under-confidence, that the models have in their inference. While the former is well studied, the latter is not, leading to an asymmetry in understanding the comprehensive risk of the model based on misplaced confidence. In this paper, we address this asymmetry by defining two types of risk (decision and composite risk), and proposing an experimental framework consisting of a two-level inference architecture and appropriate metrics for measuring such risks in both discriminative and generative LLMs. The first level relies on a decision rule that determines whether the underlying language model should abstain from inference. The second level (which applies if the model does not abstain) is the model's inference. Detailed experiments on four natural language commonsense reasoning datasets using both an open-source ensemble-based RoBERTa model and ChatGPT, demonstrate the practical utility of the evaluation framework. For example, our results show that our framework can get an LLM to confidently respond to an extra 20.1% of low-risk inference tasks that other methods might misclassify as high-risk, and skip 19.8% of high-risk tasks, which would have been answered incorrectly.

* arXiv admin note: text overlap with arXiv:2310.03283

Via

Access Paper or Ask Questions

GRASP: A Grid-Based Benchmark for Evaluating Commonsense Spatial Reasoning

Jul 02, 2024

Zhisheng Tang, Mayank Kejriwal

Figure 1 for GRASP: A Grid-Based Benchmark for Evaluating Commonsense Spatial Reasoning

Figure 2 for GRASP: A Grid-Based Benchmark for Evaluating Commonsense Spatial Reasoning

Figure 3 for GRASP: A Grid-Based Benchmark for Evaluating Commonsense Spatial Reasoning

Figure 4 for GRASP: A Grid-Based Benchmark for Evaluating Commonsense Spatial Reasoning

Abstract:Spatial reasoning, an important faculty of human cognition with many practical applications, is one of the core commonsense skills that is not purely language-based and, for satisfying (as opposed to optimal) solutions, requires some minimum degree of planning. Existing benchmarks of Commonsense Spatial Reasoning (CSR) tend to evaluate how Large Language Models (LLMs) interpret text-based spatial descriptions rather than directly evaluate a plan produced by the LLM in response to a spatial reasoning scenario. In this paper, we construct a large-scale benchmark called $\textbf{GRASP}$, which consists of 16,000 grid-based environments where the agent is tasked with an energy collection problem. These environments include 100 grid instances instantiated using each of the 160 different grid settings, involving five different energy distributions, two modes of agent starting position, and two distinct obstacle configurations, as well as three kinds of agent constraints. Using GRASP, we compare classic baseline approaches, such as random walk and greedy search methods, with advanced LLMs like GPT-3.5-Turbo and GPT-4o. The experimental results indicate that even these advanced LLMs struggle to consistently achieve satisfactory solutions.

Via

Access Paper or Ask Questions

Is persona enough for personality? Using ChatGPT to reconstruct an agent's latent personality from simple descriptions

Jun 18, 2024

Yongyi Ji, Zhisheng Tang, Mayank Kejriwal

Abstract:Personality, a fundamental aspect of human cognition, contains a range of traits that influence behaviors, thoughts, and emotions. This paper explores the capabilities of large language models (LLMs) in reconstructing these complex cognitive attributes based only on simple descriptions containing socio-demographic and personality type information. Utilizing the HEXACO personality framework, our study examines the consistency of LLMs in recovering and predicting underlying (latent) personality dimensions from simple descriptions. Our experiments reveal a significant degree of consistency in personality reconstruction, although some inconsistencies and biases, such as a tendency to default to positive traits in the absence of explicit information, are also observed. Additionally, socio-demographic factors like age and number of children were found to influence the reconstructed personality dimensions. These findings have implications for building sophisticated agent-based simulacra using LLMs and highlight the need for further research on robust personality generation in LLMs.

* Accepted to the ICML 2024 Workshop on Large Language Models and Cognition

Via

Access Paper or Ask Questions

An Evaluation of Estimative Uncertainty in Large Language Models

May 24, 2024

Zhisheng Tang, Ke Shen, Mayank Kejriwal

Abstract:Words of estimative probability (WEPs), such as ''maybe'' or ''probably not'' are ubiquitous in natural language for communicating estimative uncertainty, compared with direct statements involving numerical probability. Human estimative uncertainty, and its calibration with numerical estimates, has long been an area of study -- including by intelligence agencies like the CIA. This study compares estimative uncertainty in commonly used large language models (LLMs) like GPT-4 and ERNIE-4 to that of humans, and to each other. Here we show that LLMs like GPT-3.5 and GPT-4 align with human estimates for some, but not all, WEPs presented in English. Divergence is also observed when the LLM is presented with gendered roles and Chinese contexts. Further study shows that an advanced LLM like GPT-4 can consistently map between statistical and estimative uncertainty, but a significant performance gap remains. The results contribute to a growing body of research on human-LLM alignment.

Via

Access Paper or Ask Questions

Understanding and Estimating Domain Complexity Across Domains

Dec 20, 2023

Katarina Doctor, Mayank Kejriwal, Lawrence Holder, Eric Kildebeck, Emma Resmini, Christopher Pereyda, Robert J. Steininger, Daniel V. Olivença

Abstract:Artificial Intelligence (AI) systems, trained in controlled environments, often struggle in real-world complexities. We propose a general framework for estimating domain complexity across diverse environments, like open-world learning and real-world applications. This framework distinguishes between intrinsic complexity (inherent to the domain) and extrinsic complexity (dependent on the AI agent). By analyzing dimensionality, sparsity, and diversity within these categories, we offer a comprehensive view of domain challenges. This approach enables quantitative predictions of AI difficulty during environment transitions, avoids bias in novel situations, and helps navigate the vast search spaces of open-world domains.

* 34 pages, 13 figures, 7 tables. arXiv admin note: substantial text overlap with arXiv:2303.04141

Via

Access Paper or Ask Questions

HALO: An Ontology for Representing Hallucinations in Generative Models

Dec 08, 2023

Navapat Nananukul, Mayank Kejriwal

Abstract:Recent progress in generative AI, including large language models (LLMs) like ChatGPT, has opened up significant opportunities in fields ranging from natural language processing to knowledge discovery and data mining. However, there is also a growing awareness that the models can be prone to problems such as making information up or `hallucinations', and faulty reasoning on seemingly simple problems. Because of the popularity of models like ChatGPT, both academic scholars and citizen scientists have documented hallucinations of several different types and severity. Despite this body of work, a formal model for describing and representing these hallucinations (with relevant meta-data) at a fine-grained level, is still lacking. In this paper, we address this gap by presenting the Hallucination Ontology or HALO, a formal, extensible ontology written in OWL that currently offers support for six different types of hallucinations known to arise in LLMs, along with support for provenance and experimental metadata. We also collect and publish a dataset containing hallucinations that we inductively gathered across multiple independent Web sources, and show that HALO can be successfully used to model this dataset and answer competency questions.

Via

Access Paper or Ask Questions