Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yao Dou

Evaluating LLMs on Chinese Idiom Translation

Aug 14, 2025

Cai Yang, Yao Dou, David Heineman, Xiaofeng Wu, Wei Xu

Abstract:Idioms, whose figurative meanings usually differ from their literal interpretations, are common in everyday language, especially in Chinese, where they often contain historical references and follow specific structural patterns. Despite recent progress in machine translation with large language models, little is known about Chinese idiom translation. In this work, we introduce IdiomEval, a framework with a comprehensive error taxonomy for Chinese idiom translation. We annotate 900 translation pairs from nine modern systems, including GPT-4o and Google Translate, across four domains: web, news, Wikipedia, and social media. We find these systems fail at idiom translation, producing incorrect, literal, partial, or even missing translations. The best-performing system, GPT-4, makes errors in 28% of cases. We also find that existing evaluation metrics measure idiom quality poorly with Pearson correlation below 0.48 with human ratings. We thus develop improved models that achieve F$_1$ scores of 0.68 for detecting idiom translation errors.

* Accepted at COLM 2025

Via

Access Paper or Ask Questions

Measuring, Modeling, and Helping People Account for Privacy Risks in Online Self-Disclosures with AI

Dec 19, 2024

Isadora Krsek, Anubha Kabra, Yao Dou, Tarek Naous, Laura A. Dabbish, Alan Ritter, Wei Xu, Sauvik Das

Abstract:In pseudonymous online fora like Reddit, the benefits of self-disclosure are often apparent to users (e.g., I can vent about my in-laws to understanding strangers), but the privacy risks are more abstract (e.g., will my partner be able to tell that this is me?). Prior work has sought to develop natural language processing (NLP) tools that help users identify potentially risky self-disclosures in their text, but none have been designed for or evaluated with the users they hope to protect. Absent this assessment, these tools will be limited by the social-technical gap: users need assistive tools that help them make informed decisions, not paternalistic tools that tell them to avoid self-disclosure altogether. To bridge this gap, we conducted a study with N = 21 Reddit users; we had them use a state-of-the-art NLP disclosure detection model on two of their authored posts and asked them questions to understand if and how the model helped, where it fell short, and how it could be improved to help them make more informed decisions. Despite its imperfections, users responded positively to the model and highlighted its use as a tool that can help them catch mistakes, inform them of risks they were unaware of, and encourage self-reflection. However, our work also shows how, to be useful and usable, AI for supporting privacy decision-making must account for posting context, disclosure norms, and users' lived threat models, and provide explanations that help contextualize detected risks.

* 31 pages, 5 figues, Accepted for publication at CSCW 2025

Via

Access Paper or Ask Questions

TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models

Oct 15, 2024

Mu Cai, Reuben Tan, Jianrui Zhang, Bocheng Zou, Kai Zhang, Feng Yao, Fangrui Zhu, Jing Gu, Yiwu Zhong, Yuzhang Shang(+5 more)

Figure 1 for TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models

Figure 2 for TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models

Figure 3 for TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models

Figure 4 for TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models

Abstract:Understanding fine-grained temporal dynamics is crucial for multimodal video comprehension and generation. Due to the lack of fine-grained temporal annotations, existing video benchmarks mostly resemble static image benchmarks and are incompetent at evaluating models for temporal understanding. In this paper, we introduce TemporalBench, a new benchmark dedicated to evaluating fine-grained temporal understanding in videos. TemporalBench consists of ~10K video question-answer pairs, derived from ~2K high-quality human annotations detailing the temporal dynamics in video clips. As a result, our benchmark provides a unique testbed for evaluating various temporal understanding and reasoning abilities such as action frequency, motion magnitude, event order, etc. Moreover, it enables evaluations on various tasks like both video question answering and captioning, both short and long video understanding, as well as different models such as multimodal video embedding models and text generation models. Results show that state-of-the-art models like GPT-4o achieve only 38.5% question answering accuracy on TemporalBench, demonstrating a significant gap (~30%) between humans and AI in temporal understanding. Furthermore, we notice a critical pitfall for multi-choice QA where LLMs can detect the subtle changes in negative captions and find a centralized description as a cue for its prediction, where we propose Multiple Binary Accuracy (MBA) to correct such bias. We hope that TemporalBench can foster research on improving models' temporal reasoning capabilities. Both dataset and evaluation code will be made available.

* Project Page: https://temporalbench.github.io/

Via

Access Paper or Ask Questions

Improving Minimum Bayes Risk Decoding with Multi-Prompt

Jul 22, 2024

David Heineman, Yao Dou, Wei Xu

Figure 1 for Improving Minimum Bayes Risk Decoding with Multi-Prompt

Figure 2 for Improving Minimum Bayes Risk Decoding with Multi-Prompt

Figure 3 for Improving Minimum Bayes Risk Decoding with Multi-Prompt

Figure 4 for Improving Minimum Bayes Risk Decoding with Multi-Prompt

Abstract:While instruction fine-tuned LLMs are effective text generators, sensitivity to prompt construction makes performance unstable and sub-optimal in practice. Relying on a single "best" prompt cannot capture all differing approaches to a generation problem. Using this observation, we propose multi-prompt decoding, where many candidate generations are decoded from a prompt bank at inference-time. To ensemble candidates, we use Minimum Bayes Risk (MBR) decoding, which selects a final output using a trained value metric. We show multi-prompt improves MBR across a comprehensive set of conditional generation tasks, and show this is a result of estimating a more diverse and higher quality candidate space than that of a single prompt. Further experiments confirm multi-prompt improves generation across tasks, models and metrics.

Via

Access Paper or Ask Questions

GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation

May 21, 2024

Govind Ramesh, Yao Dou, Wei Xu

Abstract:Research on jailbreaking has been valuable for testing and understanding the safety and security issues of large language models (LLMs). In this paper, we introduce Iterative Refinement Induced Self-Jailbreak (IRIS), a novel approach that leverages the reflective capabilities of LLMs for jailbreaking with only black-box access. Unlike previous methods, IRIS simplifies the jailbreaking process by using a single model as both the attacker and target. This method first iteratively refines adversarial prompts through self-explanation, which is crucial for ensuring that even well-aligned LLMs obey adversarial instructions. IRIS then rates and enhances the output given the refined prompt to increase its harmfulness. We find IRIS achieves jailbreak success rates of 98% on GPT-4 and 92% on GPT-4 Turbo in under 7 queries. It significantly outperforms prior approaches in automatic, black-box and interpretable jailbreaking, while requiring substantially fewer queries, thereby establishing a new standard for interpretable jailbreaking methods.

Via

Access Paper or Ask Questions

Reducing Privacy Risks in Online Self-Disclosures with Language Models

Nov 16, 2023

Yao Dou, Isadora Krsek, Tarek Naous, Anubha Kabra, Sauvik Das, Alan Ritter, Wei Xu

Figure 1 for Reducing Privacy Risks in Online Self-Disclosures with Language Models

Figure 2 for Reducing Privacy Risks in Online Self-Disclosures with Language Models

Figure 3 for Reducing Privacy Risks in Online Self-Disclosures with Language Models

Figure 4 for Reducing Privacy Risks in Online Self-Disclosures with Language Models

Abstract:Self-disclosure, while being common and rewarding in social media interaction, also poses privacy risks. In this paper, we take the initiative to protect the user-side privacy associated with online self-disclosure through identification and abstraction. We develop a taxonomy of 19 self-disclosure categories, and curate a large corpus consisting of 4.8K annotated disclosure spans. We then fine-tune a language model for identification, achieving over 75% in Token F$_1$. We further conduct a HCI user study, with 82\% of participants viewing the model positively, highlighting its real world applicability. Motivated by the user feedback, we introduce the task of self-disclosure abstraction. We experiment with both one-span abstraction and three-span abstraction settings, and explore multiple fine-tuning strategies. Our best model can generate diverse abstractions that moderately reduce privacy risks while maintaining high utility according to human evaluation.

* LLMs, Privacy, HCI

Via

Access Paper or Ask Questions

Automatic and Human-AI Interactive Text Generation

Oct 05, 2023

Yao Dou, Philippe Laban, Claire Gardent, Wei Xu

Abstract:In this tutorial, we focus on text-to-text generation, a class of natural language generation (NLG) tasks, that takes a piece of text as input and then generates a revision that is improved according to some specific criteria (e.g., readability or linguistic styles), while largely retaining the original meaning and the length of the text. This includes many useful applications, such as text simplification, paraphrase generation, style transfer, etc. In contrast to text summarization and open-ended text completion (e.g., story), the text-to-text generation tasks we discuss in this tutorial are more constrained in terms of semantic consistency and targeted language styles. This level of control makes these tasks ideal testbeds for studying the ability of models to generate text that is both semantically adequate and stylistically appropriate. Moreover, these tasks are interesting from a technical standpoint, as they require complex combinations of lexical and syntactical transformations, stylistic control, and adherence to factual knowledge, -- all at once. With a special focus on text simplification and revision, this tutorial aims to provide an overview of the state-of-the-art natural language generation research from four major aspects -- Data, Models, Human-AI Collaboration, and Evaluation -- and to discuss and showcase a few significant and recent advances: (1) the use of non-retrogressive approaches; (2) the shift from fine-tuning to prompting with large language models; (3) the development of new learnable metric and fine-grained human evaluation framework; (4) a growing body of studies and datasets on non-English languages; (5) the rise of HCI+NLP+Accessibility interdisciplinary research to create real-world writing assistant systems.

* To appear at ACL 2024, Tutorial

Via

Access Paper or Ask Questions

Thresh: A Unified, Customizable and Deployable Platform for Fine-Grained Text Evaluation

Aug 15, 2023

David Heineman, Yao Dou, Wei Xu

Figure 1 for Thresh: A Unified, Customizable and Deployable Platform for Fine-Grained Text Evaluation

Figure 2 for Thresh: A Unified, Customizable and Deployable Platform for Fine-Grained Text Evaluation

Figure 3 for Thresh: A Unified, Customizable and Deployable Platform for Fine-Grained Text Evaluation

Figure 4 for Thresh: A Unified, Customizable and Deployable Platform for Fine-Grained Text Evaluation

Abstract:Fine-grained, span-level human evaluation has emerged as a reliable and robust method for evaluating text generation tasks such as summarization, simplification, machine translation and news generation, and the derived annotations have been useful for training automatic metrics and improving language models. However, existing annotation tools implemented for these evaluation frameworks lack the adaptability to be extended to different domains or languages, or modify annotation settings according to user needs. And the absence of a unified annotated data format inhibits the research in multi-task learning. In this paper, we introduce Thresh, a unified, customizable and deployable platform for fine-grained evaluation. By simply creating a YAML configuration file, users can build and test an annotation interface for any framework within minutes -- all in one web browser window. To facilitate collaboration and sharing, Thresh provides a community hub that hosts a collection of fine-grained frameworks and corresponding annotations made and collected by the community, covering a wide range of NLP tasks. For deployment, Thresh offers multiple options for any scale of annotation projects from small manual inspections to large crowdsourcing ones. Additionally, we introduce a Python library to streamline the entire process from typology design and deployment to annotation processing. Thresh is publicly accessible at https://thresh.tools.

Via

Access Paper or Ask Questions

Dancing Between Success and Failure: Edit-level Simplification Evaluation using SALSA

May 23, 2023

David Heineman, Yao Dou, Mounica Maddela, Wei Xu

Figure 1 for Dancing Between Success and Failure: Edit-level Simplification Evaluation using SALSA

Figure 2 for Dancing Between Success and Failure: Edit-level Simplification Evaluation using SALSA

Figure 3 for Dancing Between Success and Failure: Edit-level Simplification Evaluation using SALSA

Figure 4 for Dancing Between Success and Failure: Edit-level Simplification Evaluation using SALSA

Abstract:Large language models (e.g., GPT-3.5) are uniquely capable of producing highly rated text simplification, yet current human evaluation methods fail to provide a clear understanding of systems' specific strengths and weaknesses. To address this limitation, we introduce SALSA, an edit-based human annotation framework that enables holistic and fine-grained text simplification evaluation. We develop twenty one linguistically grounded edit types, covering the full spectrum of success and failure across dimensions of conceptual, syntactic and lexical simplicity. Using SALSA, we collect 12K edit annotations on 700 simplifications, revealing discrepancies in the distribution of transformation approaches performed by fine-tuned models, few-shot LLMs and humans, and finding GPT-3.5 performs more quality edits than humans, but still exhibits frequent errors. Using our fine-grained annotations, we develop LENS-SALSA, a reference-free automatic simplification metric, trained to predict sentence- and word-level quality simultaneously. Additionally, we introduce word-level quality estimation for simplification and report promising baseline results. Our training material, annotation toolkit, and data are released at http://salsa-eval.com.

Via

Access Paper or Ask Questions

LENS: A Learnable Evaluation Metric for Text Simplification

Dec 19, 2022

Mounica Maddela, Yao Dou, David Heineman, Wei Xu

Figure 1 for LENS: A Learnable Evaluation Metric for Text Simplification

Figure 2 for LENS: A Learnable Evaluation Metric for Text Simplification

Figure 3 for LENS: A Learnable Evaluation Metric for Text Simplification

Figure 4 for LENS: A Learnable Evaluation Metric for Text Simplification

Abstract:Training learnable metrics using modern language models has recently emerged as a promising method for the automatic evaluation of machine translation. However, existing human evaluation datasets in text simplification are limited by a lack of annotations, unitary simplification types, and outdated models, making them unsuitable for this approach. To address these issues, we introduce the SIMPEVAL corpus that contains: SIMPEVAL_ASSET, comprising 12K human ratings on 2.4K simplifications of 24 systems, and SIMPEVAL_2022, a challenging simplification benchmark consisting of over 1K human ratings of 360 simplifications including generations from GPT-3.5. Training on SIMPEVAL_ASSET, we present LENS, a Learnable Evaluation Metric for Text Simplification. Extensive empirical results show that LENS correlates better with human judgment than existing metrics, paving the way for future progress in the evaluation of text simplification. To create the SIMPEVAL datasets, we introduce RANK & RATE, a human evaluation framework that rates simplifications from several models in a list-wise manner by leveraging an interactive interface, which ensures both consistency and accuracy in the evaluation process. Our metric, dataset, and annotation toolkit are available at https://github.com/Yao-Dou/LENS.

Via

Access Paper or Ask Questions