Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Soyeon Caren Han

EMMM, Explain Me My Model! Explainable Machine Generated Text Detection in Dialogues

Aug 26, 2025

Angela Yifei Yuan, Haoyi Li, Soyeon Caren Han, Christopher Leckie

Abstract:The rapid adoption of large language models (LLMs) in customer service introduces new risks, as malicious actors can exploit them to conduct large-scale user impersonation through machine-generated text (MGT). Current MGT detection methods often struggle in online conversational settings, reducing the reliability and interpretability essential for trustworthy AI deployment. In customer service scenarios where operators are typically non-expert users, explanation become crucial for trustworthy MGT detection. In this paper, we propose EMMM, an explanation-then-detection framework that balances latency, accuracy, and non-expert-oriented interpretability. Experimental results demonstrate that EMMM provides explanations accessible to non-expert users, with 70\% of human evaluators preferring its outputs, while achieving competitive accuracy compared to state-of-the-art models and maintaining low latency, generating outputs within 1 second. Our code and dataset are open-sourced at https://github.com/AngieYYF/EMMM-explainable-chatbot-detection.

* 15 pages

Via

Access Paper or Ask Questions

SPADE: Systematic Prompt Framework for Automated Dialogue Expansion in Machine-Generated Text Detection

Mar 19, 2025

Haoyi Li, Angela Yifei Yuan, Soyeon Caren Han, Christopher Leckie

Abstract:The increasing capability of large language models (LLMs) to generate synthetic content has heightened concerns about their misuse, driving the development of Machine-Generated Text (MGT) detection models. However, these detectors face significant challenges due to the lack of systematically generated, high-quality datasets for training. To address this issue, we propose five novel data augmentation frameworks for synthetic user dialogue generation through a structured prompting approach, reducing the costs associated with traditional data collection methods. Our proposed method yields 14 new dialogue datasets, which we benchmark against seven MGT detection models. The results demonstrate improved generalization performance when utilizing a mixed dataset produced by our proposed augmentation framework. Furthermore, considering that real-world agents lack knowledge of future opponent utterances, we simulate online dialogue detection and examine the relationship between chat history length and detection accuracy. We also benchmark online detection performance with limited chat history on our frameworks. Our open-source datasets can be downloaded from https://github.com/AngieYYF/SPADE-customer-service-dialogue.

* 9 pages

Via

Access Paper or Ask Questions

A Training-Free Length Extrapolation Approach for LLMs: Greedy Attention Logit Interpolation (GALI)

Feb 04, 2025

Yan Li, Tianyi Zhang, Zechuan Li, Soyeon Caren Han

Abstract:Transformer-based Large Language Models (LLMs) struggle to process inputs exceeding their training context window, with performance degrading due to positional out-of-distribution (O.O.D.) that disrupt attention computations. Existing solutions, fine-tuning and training-free methods, are limited by computational inefficiency, attention logit outliers or loss of local positional information. To address this, we propose Greedy Attention Logit Interpolation (GALI), a training-free length extrapolation method that maximizes the utilization of pretrained positional intervals while avoiding attention logit outliers through attention logit interpolation. The result demonstrates that GALI consistently outperforms state-of-the-art training-free methods. Our findings reveal that LLMs interpret positional intervals unevenly within their training context window, suggesting that extrapolating within a smaller positional interval range yields superior results-even for short-context tasks. GALI represents a significant step toward resolving the positional O.O.D. challenge, enabling more reliable long-text understanding in LLMs. Our implementation of GALI, along with the experiments from our paper, is open-sourced at https://github.com/AcademyCityL/GALI.

* 9 pages, under review in the conference

Via

Access Paper or Ask Questions

Multimodal Graph Constrastive Learning and Prompt for ChartQA

Jan 08, 2025

Yue Dai, Soyeon Caren Han, Wei Liu

Abstract:ChartQA presents significant challenges due to the complex distribution of chart elements and the implicit patterns embedded within the underlying data. In this chapter, we have developed a joint multimodal scene graph for charts, explicitly representing the relationships between chart elements and their associated patterns. Our proposed multimodal scene graph consists of two components: a visual graph and a textual graph, each designed to capture the structural and semantic information within the chart. To unify representations across these different modalities, we introduce a multimodal graph contrastive learning approach that learns unified representations by maximizing similarity between nodes representing the same object across multimodal graphs. The learned graph representations can be seamlessly incorporated into a transformer decoder as a soft prompt. Additionally, given the growing need for Multimodal Large Language Models (MLLMs) in zero-shot scenarios, we have designed Chain-of-Thought (CoT) prompts for MLLMs to reduce hallucinations. We tested both methods on public benchmarks such as ChartQA, OpenCQA, and ChartX, demonstrating improved performance and validating the effectiveness of our proposed methods.

Via

Access Paper or Ask Questions

Multimodal Commonsense Knowledge Distillation for Visual Question Answering

Nov 05, 2024

Shuo Yang, Siwen Luo, Soyeon Caren Han

Abstract:Existing Multimodal Large Language Models (MLLMs) and Visual Language Pretrained Models (VLPMs) have shown remarkable performances in the general Visual Question Answering (VQA). However, these models struggle with VQA questions that require external commonsense knowledge due to the challenges in generating high-quality prompts and the high computational costs of fine-tuning. In this work, we propose a novel graph-based multimodal commonsense knowledge distillation framework that constructs a unified relational graph over commonsense knowledge, visual objects and questions through a Graph Convolutional Network (GCN) following a teacher-student environment. This proposed framework is flexible with any type of teacher and student models without further fine-tuning, and has achieved competitive performances on the ScienceQA dataset.

* AAAI 2025 (Accepted, Oral)

Via

Access Paper or Ask Questions

TriG-NER: Triplet-Grid Framework for Discontinuous Named Entity Recognition

Nov 04, 2024

Rina Carines Cabral, Soyeon Caren Han, Areej Alhassan, Riza Batista-Navarro, Goran Nenadic, Josiah Poon

Figure 1 for TriG-NER: Triplet-Grid Framework for Discontinuous Named Entity Recognition

Figure 2 for TriG-NER: Triplet-Grid Framework for Discontinuous Named Entity Recognition

Figure 3 for TriG-NER: Triplet-Grid Framework for Discontinuous Named Entity Recognition

Figure 4 for TriG-NER: Triplet-Grid Framework for Discontinuous Named Entity Recognition

Abstract:Discontinuous Named Entity Recognition (DNER) presents a challenging problem where entities may be scattered across multiple non-adjacent tokens, making traditional sequence labelling approaches inadequate. Existing methods predominantly rely on custom tagging schemes to handle these discontinuous entities, resulting in models tightly coupled to specific tagging strategies and lacking generalisability across diverse datasets. To address these challenges, we propose TriG-NER, a novel Triplet-Grid Framework that introduces a generalisable approach to learning robust token-level representations for discontinuous entity extraction. Our framework applies triplet loss at the token level, where similarity is defined by word pairs existing within the same entity, effectively pulling together similar and pushing apart dissimilar ones. This approach enhances entity boundary detection and reduces the dependency on specific tagging schemes by focusing on word-pair relationships within a flexible grid structure. We evaluate TriG-NER on three benchmark DNER datasets and demonstrate significant improvements over existing grid-based architectures. These results underscore our framework's effectiveness in capturing complex entity structures and its adaptability to various tagging schemes, setting a new benchmark for discontinuous entity extraction.

* Code will be made available upon publication

Via

Access Paper or Ask Questions

Multimodal Large Language Models and Tunings: Vision, Language, Sensors, Audio, and Beyond

Oct 08, 2024

Soyeon Caren Han, Feiqi Cao, Josiah Poon, Roberto Navigli

Abstract:This tutorial explores recent advancements in multimodal pretrained and large models, capable of integrating and processing diverse data forms such as text, images, audio, and video. Participants will gain an understanding of the foundational concepts of multimodality, the evolution of multimodal research, and the key technical challenges addressed by these models. We will cover the latest multimodal datasets and pretrained models, including those beyond vision and language. Additionally, the tutorial will delve into the intricacies of multimodal large models and instruction tuning strategies to optimise performance for specific tasks. Hands-on laboratories will offer practical experience with state-of-the-art multimodal models, demonstrating real-world applications like visual storytelling and visual question answering. This tutorial aims to equip researchers, practitioners, and newcomers with the knowledge and skills to leverage multimodal AI. ACM Multimedia 2024 is the ideal venue for this tutorial, aligning perfectly with our goal of understanding multimodal pretrained and large language models, and their tuning mechanisms.

* Accepted at ACM-MM 2024

Via

Access Paper or Ask Questions

DAViD: Domain Adaptive Visually-Rich Document Understanding with Synthetic Insights

Oct 02, 2024

Yihao Ding, Soyeon Caren Han, Zechuan Li, Hyunsuk Chung

Figure 1 for DAViD: Domain Adaptive Visually-Rich Document Understanding with Synthetic Insights

Figure 2 for DAViD: Domain Adaptive Visually-Rich Document Understanding with Synthetic Insights

Figure 3 for DAViD: Domain Adaptive Visually-Rich Document Understanding with Synthetic Insights

Figure 4 for DAViD: Domain Adaptive Visually-Rich Document Understanding with Synthetic Insights

Abstract:Visually-Rich Documents (VRDs), encompassing elements like charts, tables, and references, convey complex information across various fields. However, extracting information from these rich documents is labor-intensive, especially given their inconsistent formats and domain-specific requirements. While pretrained models for VRD Understanding have progressed, their reliance on large, annotated datasets limits scalability. This paper introduces the Domain Adaptive Visually-rich Document Understanding (DAViD) framework, which utilises machine-generated synthetic data for domain adaptation. DAViD integrates fine-grained and coarse-grained document representation learning and employs synthetic annotations to reduce the need for costly manual labelling. By leveraging pretrained models and synthetic data, DAViD achieves competitive performance with minimal annotated datasets. Extensive experiments validate DAViD's effectiveness, demonstrating its ability to efficiently adapt to domain-specific VRDU tasks.

* Work in progress

Via

Access Paper or Ask Questions

MIDAS: Multi-level Intent, Domain, And Slot Knowledge Distillation for Multi-turn NLU

Aug 15, 2024

Yan Li, So-Eon Kim, Seong-Bae Park, Soyeon Caren Han

Figure 1 for MIDAS: Multi-level Intent, Domain, And Slot Knowledge Distillation for Multi-turn NLU

Figure 2 for MIDAS: Multi-level Intent, Domain, And Slot Knowledge Distillation for Multi-turn NLU

Figure 3 for MIDAS: Multi-level Intent, Domain, And Slot Knowledge Distillation for Multi-turn NLU

Figure 4 for MIDAS: Multi-level Intent, Domain, And Slot Knowledge Distillation for Multi-turn NLU

Abstract:Although Large Language Models(LLMs) can generate coherent and contextually relevant text, they often struggle to recognise the intent behind the human user's query. Natural Language Understanding (NLU) models, however, interpret the purpose and key information of user's input to enable responsive interactions. Existing NLU models generally map individual utterances to a dual-level semantic frame, involving sentence-level intent and word-level slot labels. However, real-life conversations primarily consist of multi-turn conversations, involving the interpretation of complex and extended dialogues. Researchers encounter challenges addressing all facets of multi-turn dialogue conversations using a unified single NLU model. This paper introduces a novel approach, MIDAS, leveraging a multi-level intent, domain, and slot knowledge distillation for multi-turn NLU. To achieve this, we construct distinct teachers for varying levels of conversation knowledge, namely, sentence-level intent detection, word-level slot filling, and conversation-level domain classification. These teachers are then fine-tuned to acquire specific knowledge of their designated levels. A multi-teacher loss is proposed to facilitate the combination of these multi-level teachers, guiding a student model in multi-turn dialogue tasks. The experimental results demonstrate the efficacy of our model in improving the overall multi-turn conversation understanding, showcasing the potential for advancements in NLU models through the incorporation of multi-level dialogue knowledge distillation techniques.

Via

Access Paper or Ask Questions

MSG-Chart: Multimodal Scene Graph for ChartQA

Aug 09, 2024

Yue Dai, Soyeon Caren Han, Wei Liu

Figure 1 for MSG-Chart: Multimodal Scene Graph for ChartQA

Figure 2 for MSG-Chart: Multimodal Scene Graph for ChartQA

Figure 3 for MSG-Chart: Multimodal Scene Graph for ChartQA

Figure 4 for MSG-Chart: Multimodal Scene Graph for ChartQA

Abstract:Automatic Chart Question Answering (ChartQA) is challenging due to the complex distribution of chart elements with patterns of the underlying data not explicitly displayed in charts. To address this challenge, we design a joint multimodal scene graph for charts to explicitly represent the relationships between chart elements and their patterns. Our proposed multimodal scene graph includes a visual graph and a textual graph to jointly capture the structural and semantical knowledge from the chart. This graph module can be easily integrated with different vision transformers as inductive bias. Our experiments demonstrate that incorporating the proposed graph module enhances the understanding of charts' elements' structure and semantics, thereby improving performance on publicly available benchmarks, ChartQA and OpenCQA.

* Accpeted by CIKM Short 2024

Via

Access Paper or Ask Questions