Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Minghan Wang

MAPLE: Multi-Agent Adaptive Planning with Long-Term Memory for Table Reasoning

Jun 06, 2025

Ye Bai, Minghan Wang, Thuy-Trang Vu

Abstract:Table-based question answering requires complex reasoning capabilities that current LLMs struggle to achieve with single-pass inference. Existing approaches, such as Chain-of-Thought reasoning and question decomposition, lack error detection mechanisms and discard problem-solving experiences, contrasting sharply with how humans tackle such problems. In this paper, we propose MAPLE (Multi-agent Adaptive Planning with Long-term mEmory), a novel framework that mimics human problem-solving through specialized cognitive agents working in a feedback-driven loop. MAPLE integrates 4 key components: (1) a Solver using the ReAct paradigm for reasoning, (2) a Checker for answer verification, (3) a Reflector for error diagnosis and strategy correction, and (4) an Archiver managing long-term memory for experience reuse and evolution. Experiments on WiKiTQ and TabFact demonstrate significant improvements over existing methods, achieving state-of-the-art performance across multiple LLM backbones.

* 26 pages, 10 figures

Via

Access Paper or Ask Questions

Discrete Minds in a Continuous World: Do Language Models Know Time Passes?

Jun 06, 2025

Minghan Wang, Ye Bai, Thuy-Trang Vu, Ehsan Shareghi, Gholamreza Haffari

Figure 1 for Discrete Minds in a Continuous World: Do Language Models Know Time Passes?

Figure 2 for Discrete Minds in a Continuous World: Do Language Models Know Time Passes?

Figure 3 for Discrete Minds in a Continuous World: Do Language Models Know Time Passes?

Figure 4 for Discrete Minds in a Continuous World: Do Language Models Know Time Passes?

Abstract:While Large Language Models (LLMs) excel at temporal reasoning tasks like event ordering and duration estimation, their ability to perceive the actual passage of time remains unexplored. We investigate whether LLMs perceive the passage of time and adapt their decision-making accordingly through three complementary experiments. First, we introduce the Token-Time Hypothesis, positing that LLMs can map discrete token counts to continuous wall-clock time, and validate this through a dialogue duration judgment task. Second, we demonstrate that LLMs could use this awareness to adapt their response length while maintaining accuracy when users express urgency in question answering tasks. Finally, we develop BombRush, an interactive navigation challenge that examines how LLMs modify behavior under progressive time pressure in dynamic environments. Our findings indicate that LLMs possess certain awareness of time passage, enabling them to bridge discrete linguistic tokens and continuous physical time, though this capability varies with model size and reasoning abilities. This work establishes a theoretical foundation for enhancing temporal awareness in LLMs for time-sensitive applications.

Via

Access Paper or Ask Questions

SpeechDialogueFactory: Generating High-Quality Speech Dialogue Data to Accelerate Your Speech-LLM Development

Mar 31, 2025

Minghan Wang, Ye Bai, Yuxia Wang, Thuy-Trang Vu, Ehsan Shareghi, Gholamreza Haffari

Abstract:High-quality speech dialogue datasets are crucial for Speech-LLM development, yet existing acquisition methods face significant limitations. Human recordings incur high costs and privacy concerns, while synthetic approaches often lack conversational authenticity. To address these challenges, we introduce \textsc{SpeechDialogueFactory}, a production-ready framework for generating natural speech dialogues efficiently. Our solution employs a comprehensive pipeline including metadata generation, dialogue scripting, paralinguistic-enriched utterance simulation, and natural speech synthesis with voice cloning. Additionally, the system provides an interactive UI for detailed sample inspection and a high-throughput batch synthesis mode. Evaluations show that dialogues generated by our system achieve a quality comparable to human recordings while significantly reducing production costs. We release our work as an open-source toolkit, alongside example datasets available in English and Chinese, empowering researchers and developers in Speech-LLM research and development.

Via

Access Paper or Ask Questions

Qorgau: Evaluating LLM Safety in Kazakh-Russian Bilingual Contexts

Feb 19, 2025

Maiya Goloburda, Nurkhan Laiyk, Diana Turmakhan, Yuxia Wang, Mukhammed Togmanov, Jonibek Mansurov, Askhat Sametov, Nurdaulet Mukhituly, Minghan Wang, Daniil Orel(+4 more)

Abstract:Large language models (LLMs) are known to have the potential to generate harmful content, posing risks to users. While significant progress has been made in developing taxonomies for LLM risks and safety evaluation prompts, most studies have focused on monolingual contexts, primarily in English. However, language- and region-specific risks in bilingual contexts are often overlooked, and core findings can diverge from those in monolingual settings. In this paper, we introduce Qorgau, a novel dataset specifically designed for safety evaluation in Kazakh and Russian, reflecting the unique bilingual context in Kazakhstan, where both Kazakh (a low-resource language) and Russian (a high-resource language) are spoken. Experiments with both multilingual and language-specific LLMs reveal notable differences in safety performance, emphasizing the need for tailored, region-specific datasets to ensure the responsible and safe deployment of LLMs in countries like Kazakhstan. Warning: this paper contains example data that may be offensive, harmful, or biased.

Via

Access Paper or Ask Questions

Proverbs Run in Pairs: Evaluating Proverb Translation Capability of Large Language Model

Jan 21, 2025

Minghan Wang, Viet-Thanh Pham, Farhad Moghimifar, Thuy-Trang Vu

Abstract:Despite achieving remarkable performance, machine translation (MT) research remains underexplored in terms of translating cultural elements in languages, such as idioms, proverbs, and colloquial expressions. This paper investigates the capability of state-of-the-art neural machine translation (NMT) and large language models (LLMs) in translating proverbs, which are deeply rooted in cultural contexts. We construct a translation dataset of standalone proverbs and proverbs in conversation for four language pairs. Our experiments show that the studied models can achieve good translation between languages with similar cultural backgrounds, and LLMs generally outperform NMT models in proverb translation. Furthermore, we find that current automatic evaluation metrics such as BLEU, CHRF++ and COMET are inadequate for reliably assessing the quality of proverb translation, highlighting the need for more culturally aware evaluation metrics.

Via

Access Paper or Ask Questions

Loki: An Open-Source Tool for Fact Verification

Oct 02, 2024

Haonan Li, Xudong Han, Hao Wang, Yuxia Wang, Minghan Wang, Rui Xing, Yilin Geng, Zenan Zhai, Preslav Nakov, Timothy Baldwin

Figure 1 for Loki: An Open-Source Tool for Fact Verification

Figure 2 for Loki: An Open-Source Tool for Fact Verification

Figure 3 for Loki: An Open-Source Tool for Fact Verification

Figure 4 for Loki: An Open-Source Tool for Fact Verification

Abstract:We introduce Loki, an open-source tool designed to address the growing problem of misinformation. Loki adopts a human-centered approach, striking a balance between the quality of fact-checking and the cost of human involvement. It decomposes the fact-checking task into a five-step pipeline: breaking down long texts into individual claims, assessing their check-worthiness, generating queries, retrieving evidence, and verifying the claims. Instead of fully automating the claim verification process, Loki provides essential information at each step to assist human judgment, especially for general users such as journalists and content moderators. Moreover, it has been optimized for latency, robustness, and cost efficiency at a commercially usable level. Loki is released under an MIT license and is available on GitHub. We also provide a video presenting the system and its capabilities.

Via

Access Paper or Ask Questions

Can Machines Resonate with Humans? Evaluating the Emotional and Empathic Comprehension of LMs

Jun 17, 2024

Muhammad Arslan Manzoor, Yuxia Wang, Minghan Wang, Preslav Nakov

Figure 1 for Can Machines Resonate with Humans? Evaluating the Emotional and Empathic Comprehension of LMs

Figure 2 for Can Machines Resonate with Humans? Evaluating the Emotional and Empathic Comprehension of LMs

Figure 3 for Can Machines Resonate with Humans? Evaluating the Emotional and Empathic Comprehension of LMs

Figure 4 for Can Machines Resonate with Humans? Evaluating the Emotional and Empathic Comprehension of LMs

Abstract:Empathy plays a pivotal role in fostering prosocial behavior, often triggered by the sharing of personal experiences through narratives. However, modeling empathy using NLP approaches remains challenging due to its deep interconnection with human interaction dynamics. Previous approaches, which involve fine-tuning language models (LMs) on human-annotated empathic datasets, have had limited success. In our pursuit of improving empathy understanding in LMs, we propose several strategies, including contrastive learning with masked LMs and supervised fine-tuning with Large Language Models (LLMs). While these methods show improvements over previous methods, the overall results remain unsatisfactory. To better understand this trend, we performed an analysis which reveals a low agreement among annotators. This lack of consensus hinders training and highlights the subjective nature of the task. We also explore the cultural impact on annotations. To study this, we meticulously collected story pairs in Urdu language and find that subjectivity in interpreting empathy among annotators appears to be independent of cultural background. The insights from our systematic exploration of LMs' understanding of empathy suggest that there is considerable room for exploration in both task formulation and modeling.

* 18 pages

Via

Access Paper or Ask Questions

Exploring the Potential of Multimodal LLM with Knowledge-Intensive Multimodal ASR

Jun 16, 2024

Minghan Wang, Yuxia Wang, Thuy-Trang Vu, Ehsan Shareghi, Gholamreza Haffari

Figure 1 for Exploring the Potential of Multimodal LLM with Knowledge-Intensive Multimodal ASR

Figure 2 for Exploring the Potential of Multimodal LLM with Knowledge-Intensive Multimodal ASR

Figure 3 for Exploring the Potential of Multimodal LLM with Knowledge-Intensive Multimodal ASR

Figure 4 for Exploring the Potential of Multimodal LLM with Knowledge-Intensive Multimodal ASR

Abstract:Recent advancements in multimodal large language models (MLLMs) have made significant progress in integrating information across various modalities, yet real-world applications in educational and scientific domains remain challenging. This paper introduces the Multimodal Scientific ASR (MS-ASR) task, which focuses on transcribing scientific conference videos by leveraging visual information from slides to enhance the accuracy of technical terminologies. Realized that traditional metrics like WER fall short in assessing performance accurately, prompting the proposal of severity-aware WER (SWER) that considers the content type and severity of ASR errors. We propose the Scientific Vision Augmented ASR (SciVASR) framework as a baseline method, enabling MLLMs to improve transcript quality through post-editing. Evaluations of state-of-the-art MLLMs, including GPT-4o, show a 45% improvement over speech-only baselines, highlighting the importance of multimodal information integration.

Via

Access Paper or Ask Questions

OpenFactCheck: A Unified Framework for Factuality Evaluation of LLMs

May 09, 2024

Yuxia Wang, Minghan Wang, Hasan Iqbal, Georgi Georgiev, Jiahui Geng, Preslav Nakov

Abstract:The increased use of large language models (LLMs) across a variety of real-world applications calls for mechanisms to verify the factual accuracy of their outputs. Difficulties lie in assessing the factuality of free-form responses in open domains. Also, different papers use disparate evaluation benchmarks and measurements, which renders them hard to compare and hampers future progress. To mitigate these issues, we propose OpenFactCheck, a unified factuality evaluation framework for LLMs. OpenFactCheck consists of three modules: (i) CUSTCHECKER allows users to easily customize an automatic fact-checker and verify the factual correctness of documents and claims, (ii) LLMEVAL, a unified evaluation framework assesses LLM's factuality ability from various perspectives fairly, and (iii) CHECKEREVAL is an extensible solution for gauging the reliability of automatic fact-checkers' verification results using human-annotated datasets. OpenFactCheck is publicly released at https://github.com/yuxiaw/OpenFactCheck.

* 19 pages, 8 tables, 8 figures

Via

Access Paper or Ask Questions

Against The Achilles' Heel: A Survey on Red Teaming for Generative Models

Mar 31, 2024

Lizhi Lin, Honglin Mu, Zenan Zhai, Minghan Wang, Yuxia Wang, Renxi Wang, Junjie Gao, Yixuan Zhang, Wanxiang Che, Timothy Baldwin(+2 more)

Figure 1 for Against The Achilles' Heel: A Survey on Red Teaming for Generative Models

Figure 2 for Against The Achilles' Heel: A Survey on Red Teaming for Generative Models

Figure 3 for Against The Achilles' Heel: A Survey on Red Teaming for Generative Models

Figure 4 for Against The Achilles' Heel: A Survey on Red Teaming for Generative Models

Abstract:Generative models are rapidly gaining popularity and being integrated into everyday applications, raising concerns over their safety issues as various vulnerabilities are exposed. Faced with the problem, the field of red teaming is experiencing fast-paced growth, which highlights the need for a comprehensive organization covering the entire pipeline and addressing emerging topics for the community. Our extensive survey, which examines over 120 papers, introduces a taxonomy of fine-grained attack strategies grounded in the inherent capabilities of language models. Additionally, we have developed the searcher framework that unifies various automatic red teaming approaches. Moreover, our survey covers novel areas including multimodal attacks and defenses, risks around multilingual models, overkill of harmless queries, and safety of downstream applications. We hope this survey can provide a systematic perspective on the field and unlock new areas of research.

Via

Access Paper or Ask Questions