Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Preetam Joshi

All Required, In Order: Phase-Level Evaluation for AI-Human Dialogue in Healthcare and Beyond

Jan 13, 2026

Shubham Kulkarni, Alexander Lyzhov, Shiva Chaitanya, Preetam Joshi

Abstract:Conversational AI is starting to support real clinical work, but most evaluation methods miss how compliance depends on the full course of a conversation. We introduce Obligatory-Information Phase Structured Compliance Evaluation (OIP-SCE), an evaluation method that checks whether every required clinical obligation is met, in the right order, with clear evidence for clinicians to review. This makes complex rules practical and auditable, helping close the gap between technical progress and what healthcare actually needs. We demonstrate the method in two case studies (respiratory history, benefits verification) and show how phase-level evidence turns policy into shared, actionable steps. By giving clinicians control over what to check and engineers a clear specification to implement, OIP-SCE provides a single, auditable evaluation surface that aligns AI capability with clinical workflow and supports routine, safe use.

* Accepted at the AI for Medicine and Healthcare (AIMedHealth) Bridge Program, AAAI-26, Singapore. Full-length paper; to appear in Proceedings of Machine Learning Research (PMLR)

Via

Access Paper or Ask Questions

HalluciNot: Hallucination Detection Through Context and Common Knowledge Verification

Apr 09, 2025

Bibek Paudel, Alexander Lyzhov, Preetam Joshi, Puneet Anand

Figure 1 for HalluciNot: Hallucination Detection Through Context and Common Knowledge Verification

Figure 2 for HalluciNot: Hallucination Detection Through Context and Common Knowledge Verification

Figure 3 for HalluciNot: Hallucination Detection Through Context and Common Knowledge Verification

Figure 4 for HalluciNot: Hallucination Detection Through Context and Common Knowledge Verification

Abstract:This paper introduces a comprehensive system for detecting hallucinations in large language model (LLM) outputs in enterprise settings. We present a novel taxonomy of LLM responses specific to hallucination in enterprise applications, categorizing them into context-based, common knowledge, enterprise-specific, and innocuous statements. Our hallucination detection model HDM-2 validates LLM responses with respect to both context and generally known facts (common knowledge). It provides both hallucination scores and word-level annotations, enabling precise identification of problematic content. To evaluate it on context-based and common-knowledge hallucinations, we introduce a new dataset HDMBench. Experimental results demonstrate that HDM-2 out-performs existing approaches across RagTruth, TruthfulQA, and HDMBench datasets. This work addresses the specific challenges of enterprise deployment, including computational efficiency, domain specialization, and fine-grained error identification. Our evaluation dataset, model weights, and inference code are publicly available.

Via

Access Paper or Ask Questions