Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chieh-Yen Lin

Language Matters: How Do Multilingual Input and Reasoning Paths Affect Large Reasoning Models?

May 23, 2025

Zhi Rui Tam, Cheng-Kuang Wu, Yu Ying Chiu, Chieh-Yen Lin, Yun-Nung Chen, Hung-yi Lee

Abstract:Large reasoning models (LRMs) have demonstrated impressive performance across a range of reasoning tasks, yet little is known about their internal reasoning processes in multilingual settings. We begin with a critical question: {\it In which language do these models reason when solving problems presented in different languages?} Our findings reveal that, despite multilingual training, LRMs tend to default to reasoning in high-resource languages (e.g., English) at test time, regardless of the input language. When constrained to reason in the same language as the input, model performance declines, especially for low-resource languages. In contrast, reasoning in high-resource languages generally preserves performance. We conduct extensive evaluations across reasoning-intensive tasks (MMMLU, MATH-500) and non-reasoning benchmarks (CulturalBench, LMSYS-toxic), showing that the effect of language choice varies by task type: input-language reasoning degrades performance on reasoning tasks but benefits cultural tasks, while safety evaluations exhibit language-specific behavior. By exposing these linguistic biases in LRMs, our work highlights a critical step toward developing more equitable models that serve users across diverse linguistic backgrounds.

Via

Access Paper or Ask Questions

Answer, Refuse, or Guess? Investigating Risk-Aware Decision Making in Language Models

Mar 03, 2025

Cheng-Kuang Wu, Zhi Rui Tam, Chieh-Yen Lin, Yun-Nung Chen, Hung-yi Lee

Abstract:Knowing when to answer or refuse is crucial for safe and reliable decision-making language agents. Although prior work has introduced refusal strategies to boost LMs' reliability, how these models adapt their decisions to different risk levels remains underexplored. We formalize the task of risk-aware decision-making, expose critical weaknesses in existing LMs, and propose skill-decomposition solutions to mitigate them. Our findings show that even cutting-edge LMs--both regular and reasoning models--still require explicit prompt chaining to handle the task effectively, revealing the challenges that must be overcome to achieve truly autonomous decision-making agents.

* preprint

Via

Access Paper or Ask Questions

None of the Above, Less of the Right: Parallel Patterns between Humans and LLMs on Multi-Choice Questions Answering

Mar 03, 2025

Zhi Rui Tam, Cheng-Kuang Wu, Chieh-Yen Lin, Yun-Nung Chen

Abstract:Multiple-choice exam questions with "None of the above" (NA) options have been extensively studied in educational testing, in which existing research suggests that they better assess true knowledge. However, their impact on Large Language Models (LLMs) evaluation remains underexplored. Through systematic experiments with 28 LLMs on the MMLU benchmark, we examine how NA options affect model performance and confidence calibration. Our analysis reveals that NA options, when used as the correct answer, lead to a consistent 30-50\% performance drop across models regardless of scale--suggesting that LLMs lack the meta-cognitive ability to systematically evaluate and reject all given options when none are correct. This degradation shows strong domain dependence, with minimal impact on mathematical reasoning (14.6\% drop) but severe effects on tasks requiring uncertainty handling like business ethics (48.1\% drop). Our results highlight important implications for benchmark design and raise questions about LLMs' ability to handle uncertainty in real-world applications.

Via

Access Paper or Ask Questions

Clear Minds Think Alike: What Makes LLM Fine-tuning Robust? A Study of Token Perplexity

Jan 24, 2025

Chao-Chung Wu, Zhi Rui Tam, Chieh-Yen Lin, Hung-yi Lee, Yun-Nung Chen

Figure 1 for Clear Minds Think Alike: What Makes LLM Fine-tuning Robust? A Study of Token Perplexity

Figure 2 for Clear Minds Think Alike: What Makes LLM Fine-tuning Robust? A Study of Token Perplexity

Figure 3 for Clear Minds Think Alike: What Makes LLM Fine-tuning Robust? A Study of Token Perplexity

Figure 4 for Clear Minds Think Alike: What Makes LLM Fine-tuning Robust? A Study of Token Perplexity

Abstract:Maintaining consistent model performance across domains is a fundamental challenge in machine learning. While recent work has explored using LLM-generated data for fine-tuning, its impact on cross-domain generalization remains poorly understood. In this paper, we present a systematic analysis revealing that fine-tuning with LLM-generated data not only improves target task performance but also reduces out-of-domain (OOD) degradation compared to fine-tuning with ground truth data. Through analyzing the data sequence in tasks of various domains, we demonstrate that this enhanced OOD robustness stems from a reduced prevalence of high perplexity tokens in LLM-generated sequences. Following this hypothesis we showed that masking high perplexity tokens in ground truth training data also achieves similar OOD preservation comparable to using LLM-generated data. Extensive experiments across diverse model architectures and scales, including Gemma2-2B, Mistral-7B and Llama3-8B, corroborate the consistency of our findings. To the best of our knowledge, this work provides the first mechanistic explanation for the superior OOD robustness conferred by LLM-generated training data, offering valuable insights for developing more robust fine-tuning strategies.

Via

Access Paper or Ask Questions

Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models

Aug 05, 2024

Zhi Rui Tam, Cheng-Kuang Wu, Yi-Lin Tsai, Chieh-Yen Lin, Hung-yi Lee, Yun-Nung Chen

Abstract:Structured generation, the process of producing content in standardized formats like JSON and XML, is widely utilized in real-world applications to extract key output information from large language models (LLMs). This study investigates whether such constraints on generation space impact LLMs' abilities, including reasoning and domain knowledge comprehension. Specifically, we evaluate LLMs' performance when restricted to adhere to structured formats versus generating free-form responses across various common tasks. Surprisingly, we observe a significant decline in LLMs' reasoning abilities under format restrictions. Furthermore, we find that stricter format constraints generally lead to greater performance degradation in reasoning tasks.

* 18 pages

Via

Access Paper or Ask Questions

I Need Help! Evaluating LLM's Ability to Ask for Users' Support: A Case Study on Text-to-SQL Generation

Jul 20, 2024

Cheng-Kuang Wu, Zhi Rui Tam, Chao-Chung Wu, Chieh-Yen Lin, Hung-yi Lee, Yun-Nung Chen

Abstract:In this study, we explore the proactive ability of LLMs to seek user support, using text-to-SQL generation as a case study. We propose metrics to evaluate the trade-off between performance improvements and user burden, and investigate whether LLMs can determine when to request help and examine their performance with varying levels of information availability. Our experiments reveal that without external feedback, many LLMs struggle to recognize their need for additional support. Our findings highlight the importance of external signals and provide insights for future research on improving support-seeking strategies.

* 9 pages, 9 figures

Via

Access Paper or Ask Questions

StreamBench: Towards Benchmarking Continuous Improvement of Language Agents

Jun 13, 2024

Cheng-Kuang Wu, Zhi Rui Tam, Chieh-Yen Lin, Yun-Nung Chen, Hung-yi Lee

Abstract:Recent works have shown that large language model (LLM) agents are able to improve themselves from experience, which is an important ability for continuous enhancement post-deployment. However, existing benchmarks primarily evaluate their innate capabilities and do not assess their ability to improve over time. To address this gap, we introduce StreamBench, a pioneering benchmark designed to evaluate the continuous improvement of LLM agents over an input-feedback sequence. StreamBench simulates an online learning environment where LLMs receive a continuous flow of feedback stream and iteratively enhance their performance. In addition, we propose several simple yet effective baselines for improving LLMs on StreamBench, and provide a comprehensive analysis to identify critical components that contribute to successful streaming strategies. Our work serves as a stepping stone towards developing effective online learning strategies for LLMs, paving the way for more adaptive AI systems in streaming scenarios.

Via

Access Paper or Ask Questions