Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Avinash Baidya

The Behavior Gap: Evaluating Zero-shot LLM Agents in Complex Task-Oriented Dialogs

Jun 13, 2025

Avinash Baidya, Kamalika Das, Xiang Gao

Abstract:Large Language Model (LLM)-based agents have significantly impacted Task-Oriented Dialog Systems (TODS) but continue to face notable performance challenges, especially in zero-shot scenarios. While prior work has noted this performance gap, the behavioral factors driving the performance gap remain under-explored. This study proposes a comprehensive evaluation framework to quantify the behavior gap between AI agents and human experts, focusing on discrepancies in dialog acts, tool usage, and knowledge utilization. Our findings reveal that this behavior gap is a critical factor negatively impacting the performance of LLM agents. Notably, as task complexity increases, the behavior gap widens (correlation: 0.963), leading to a degradation of agent performance on complex task-oriented dialogs. For the most complex task in our study, even the GPT-4o-based agent exhibits low alignment with human behavior, with low F1 scores for dialog acts (0.464), excessive and often misaligned tool usage with a F1 score of 0.139, and ineffective usage of external knowledge. Reducing such behavior gaps leads to significant performance improvement (24.3% on average). This study highlights the importance of comprehensive behavioral evaluations and improved alignment strategies to enhance the effectiveness of LLM-based TODS in handling complex tasks.

* ACL 2025; 18 pages, 8 figures

Via

Access Paper or Ask Questions

The Missing Invariance Principle Found -- the Reciprocal Twin of Invariant Risk Minimization

May 29, 2022

Dongsung Huh, Avinash Baidya

Figure 1 for The Missing Invariance Principle Found -- the Reciprocal Twin of Invariant Risk Minimization

Figure 2 for The Missing Invariance Principle Found -- the Reciprocal Twin of Invariant Risk Minimization

Figure 3 for The Missing Invariance Principle Found -- the Reciprocal Twin of Invariant Risk Minimization

Figure 4 for The Missing Invariance Principle Found -- the Reciprocal Twin of Invariant Risk Minimization

Abstract:Machine learning models often generalize poorly to out-of-distribution (OOD) data as a result of relying on features that are spuriously correlated with the label during training. Recently, the technique of Invariant Risk Minimization (IRM) was proposed to learn predictors that only use invariant features by conserving the feature-conditioned class expectation $\mathbb{E}_e[y|f(x)]$ across environments. However, more recent studies have demonstrated that IRM can fail in various task settings. Here, we identify a fundamental flaw of IRM formulation that causes the failure. We then introduce a complementary notion of invariance, MRI, that is based on conserving the class-conditioned feature expectation $\mathbb{E}_e[f(x)|y]$ across environments, that corrects for the flaw in IRM. Further, we introduce a simplified, practical version of the MRI formulation called as MRI-v1. We note that this constraint is convex which confers it with an advantage over the practical version of IRM, IRM-v1, which imposes non-convex constraints. We prove that in a general linear problem setting, MRI-v1 can guarantee invariant predictors given sufficient environments. We also empirically demonstrate that MRI strongly out-performs IRM and consistently achieves near-optimal OOD generalization in image-based nonlinear problems.

Via

Access Paper or Ask Questions

Combining Different V1 Brain Model Variants to Improve Robustness to Image Corruptions in CNNs

Oct 20, 2021

Avinash Baidya, Joel Dapello, James J. DiCarlo, Tiago Marques

Figure 1 for Combining Different V1 Brain Model Variants to Improve Robustness to Image Corruptions in CNNs

Figure 2 for Combining Different V1 Brain Model Variants to Improve Robustness to Image Corruptions in CNNs

Figure 3 for Combining Different V1 Brain Model Variants to Improve Robustness to Image Corruptions in CNNs

Figure 4 for Combining Different V1 Brain Model Variants to Improve Robustness to Image Corruptions in CNNs

Abstract:While some convolutional neural networks (CNNs) have surpassed human visual abilities in object classification, they often struggle to recognize objects in images corrupted with different types of common noise patterns, highlighting a major limitation of this family of models. Recently, it has been shown that simulating a primary visual cortex (V1) at the front of CNNs leads to small improvements in robustness to these image perturbations. In this study, we start with the observation that different variants of the V1 model show gains for specific corruption types. We then build a new model using an ensembling technique, which combines multiple individual models with different V1 front-end variants. The model ensemble leverages the strengths of each individual model, leading to significant improvements in robustness across all corruption categories and outperforming the base model by 38% on average. Finally, we show that using distillation, it is possible to partially compress the knowledge in the ensemble model into a single model with a V1 front-end. While the ensembling and distillation techniques used here are hardly biologically-plausible, the results presented here demonstrate that by combining the specific strengths of different neuronal circuits in V1 it is possible to improve the robustness of CNNs for a wide range of perturbations.

* 15 pages with supplementary material, 3 main figures, 2 supplementary figures, 4 supplementary tables

Via

Access Paper or Ask Questions