Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yvette Graham

Audio-Based Crowd-Sourced Evaluation of Machine Translation Quality

Sep 17, 2025

Sami Ul Haq, Sheila Castilho, Yvette Graham

Abstract:Machine Translation (MT) has achieved remarkable performance, with growing interest in speech translation and multimodal approaches. However, despite these advancements, MT quality assessment remains largely text centric, typically relying on human experts who read and compare texts. Since many real-world MT applications (e.g Google Translate Voice Mode, iFLYTEK Translator) involve translation being spoken rather printed or read, a more natural way to assess translation quality would be through speech as opposed text-only evaluations. This study compares text-only and audio-based evaluations of 10 MT systems from the WMT General MT Shared Task, using crowd-sourced judgments collected via Amazon Mechanical Turk. We additionally, performed statistical significance testing and self-replication experiments to test reliability and consistency of audio-based approach. Crowd-sourced assessments based on audio yield rankings largely consistent with text only evaluations but, in some cases, identify significant differences between translation systems. We attribute this to speech richer, more natural modality and propose incorporating speech-based assessments into future MT evaluation frameworks.

* Accepted at WMT2025 (ENNLP) for oral presented

Via

Access Paper or Ask Questions

REFINE-LM: Mitigating Language Model Stereotypes via Reinforcement Learning

Aug 18, 2024

Rameez Qureshi, Naïm Es-Sebbani, Luis Galárraga, Yvette Graham, Miguel Couceiro, Zied Bouraoui

Figure 1 for REFINE-LM: Mitigating Language Model Stereotypes via Reinforcement Learning

Figure 2 for REFINE-LM: Mitigating Language Model Stereotypes via Reinforcement Learning

Figure 3 for REFINE-LM: Mitigating Language Model Stereotypes via Reinforcement Learning

Figure 4 for REFINE-LM: Mitigating Language Model Stereotypes via Reinforcement Learning

Abstract:With the introduction of (large) language models, there has been significant concern about the unintended bias such models may inherit from their training data. A number of studies have shown that such models propagate gender stereotypes, as well as geographical and racial bias, among other biases. While existing works tackle this issue by preprocessing data and debiasing embeddings, the proposed methods require a lot of computational resources and annotation effort while being limited to certain types of biases. To address these issues, we introduce REFINE-LM, a debiasing method that uses reinforcement learning to handle different types of biases without any fine-tuning. By training a simple model on top of the word probability distribution of a LM, our bias agnostic reinforcement learning method enables model debiasing without human annotations or significant computational resources. Experiments conducted on a wide range of models, including several LMs, show that our method (i) significantly reduces stereotypical biases while preserving LMs performance; (ii) is applicable to different types of biases, generalizing across contexts such as gender, ethnicity, religion, and nationality-based biases; and (iii) it is not expensive to train.

Via

Access Paper or Ask Questions

ALoRA: Allocating Low-Rank Adaptation for Fine-tuning Large Language Models

Mar 24, 2024

Zequan Liu, Jiawen Lyn, Wei Zhu, Xing Tian, Yvette Graham

Figure 1 for ALoRA: Allocating Low-Rank Adaptation for Fine-tuning Large Language Models

Figure 2 for ALoRA: Allocating Low-Rank Adaptation for Fine-tuning Large Language Models

Figure 3 for ALoRA: Allocating Low-Rank Adaptation for Fine-tuning Large Language Models

Figure 4 for ALoRA: Allocating Low-Rank Adaptation for Fine-tuning Large Language Models

Abstract:Parameter-efficient fine-tuning (PEFT) is widely studied for its effectiveness and efficiency in the era of large language models. Low-rank adaptation (LoRA) has demonstrated commendable performance as a popular and representative method. However, it is implemented with a fixed intrinsic rank that might not be the ideal setting for the downstream tasks. Recognizing the need for more flexible downstream task adaptation, we extend the methodology of LoRA to an innovative approach we call allocating low-rank adaptation (ALoRA) that enables dynamic adjustments to the intrinsic rank during the adaptation process. First, we propose a novel method, AB-LoRA, that can effectively estimate the importance score of each LoRA rank. Second, guided by AB-LoRA, we gradually prune abundant and negatively impacting LoRA ranks and allocate the pruned LoRA budgets to important Transformer modules needing higher ranks. We have conducted experiments on various tasks, and the experimental results demonstrate that our ALoRA method can outperform the recent baselines with comparable tunable parameters.

* Accepted by NAACL-2024

Via

Access Paper or Ask Questions

Findings of the First Workshop on Simulating Conversational Intelligence in Chat

Feb 09, 2024

Yvette Graham, Mohammed Rameez Qureshi, Haider Khalid, Gerasimos Lampouras, Ignacio Iacobacci, Qun Liu

Abstract:The aim of this workshop is to bring together experts working on open-domain dialogue research. In this speedily advancing research area many challenges still exist, such as learning information from conversations, engaging in realistic and convincing simulation of human intelligence and reasoning. SCI-CHAT follows previous workshops on open domain dialogue but with a focus on the simulation of intelligent conversation as judged in a live human evaluation. Models aim to include the ability to follow a challenging topic over a multi-turn conversation, while positing, refuting and reasoning over arguments. The workshop included both a research track and shared task. The main goal of this paper is to provide an overview of the shared task and a link to an additional paper that will include an in depth analysis of the shared task results following presentation at the workshop.

Via

Access Paper or Ask Questions

Findings of the WMT 2023 Shared Task on Discourse-Level Literary Translation: A Fresh Orb in the Cosmos of LLMs

Nov 06, 2023

Longyue Wang, Zhaopeng Tu, Yan Gu, Siyou Liu, Dian Yu, Qingsong Ma, Chenyang Lyu, Liting Zhou, Chao-Hong Liu, Yufeng Ma(+7 more)

Abstract:Translating literary works has perennially stood as an elusive dream in machine translation (MT), a journey steeped in intricate challenges. To foster progress in this domain, we hold a new shared task at WMT 2023, the first edition of the Discourse-Level Literary Translation. First, we (Tencent AI Lab and China Literature Ltd.) release a copyrighted and document-level Chinese-English web novel corpus. Furthermore, we put forth an industry-endorsed criteria to guide human evaluation process. This year, we totally received 14 submissions from 7 academia and industry teams. We employ both automatic and human evaluations to measure the performance of the submitted systems. The official ranking of the systems is based on the overall human judgments. In addition, our extensive analysis reveals a series of interesting findings on literary and discourse-aware MT. We release data, system outputs, and leaderboard at http://www2.statmt.org/wmt23/literary-translation-task.html.

* WMT2023 Discourse-Level Literary Translation Shared Task Overview Paper

Via

Access Paper or Ask Questions

Do Stochastic Parrots have Feelings Too? Improving Neural Detection of Synthetic Text via Emotion Recognition

Oct 24, 2023

Alan Cowap, Yvette Graham, Jennifer Foster

Figure 1 for Do Stochastic Parrots have Feelings Too? Improving Neural Detection of Synthetic Text via Emotion Recognition

Figure 2 for Do Stochastic Parrots have Feelings Too? Improving Neural Detection of Synthetic Text via Emotion Recognition

Figure 3 for Do Stochastic Parrots have Feelings Too? Improving Neural Detection of Synthetic Text via Emotion Recognition

Figure 4 for Do Stochastic Parrots have Feelings Too? Improving Neural Detection of Synthetic Text via Emotion Recognition

Abstract:Recent developments in generative AI have shone a spotlight on high-performance synthetic text generation technologies. The now wide availability and ease of use of such models highlights the urgent need to provide equally powerful technologies capable of identifying synthetic text. With this in mind, we draw inspiration from psychological studies which suggest that people can be driven by emotion and encode emotion in the text they compose. We hypothesize that pretrained language models (PLMs) have an affective deficit because they lack such an emotional driver when generating text and consequently may generate synthetic text which has affective incoherence i.e. lacking the kind of emotional coherence present in human-authored text. We subsequently develop an emotionally aware detector by fine-tuning a PLM on emotion. Experiment results indicate that our emotionally-aware detector achieves improvements across a range of synthetic text generators, various sized models, datasets, and domains. Finally, we compare our emotionally-aware synthetic text detector to ChatGPT in the task of identification of its own output and show substantial gains, reinforcing the potential of emotion as a signal to identify synthetic text. Code, models, and datasets are available at https: //github.com/alanagiasi/emoPLMsynth

* Accepted to Findings of EMNLP 2023 (long paper). Camera ready version

Via

Access Paper or Ask Questions

An overview on the evaluated video retrieval tasks at TRECVID 2022

Jun 22, 2023

George Awad, Keith Curtis, Asad Butt, Jonathan Fiscus, Afzal Godil, Yooyoung Lee, Andrew Delgado, Eliot Godard, Lukas Diduch, Jeffrey Liu(+2 more)

Figure 1 for An overview on the evaluated video retrieval tasks at TRECVID 2022

Figure 2 for An overview on the evaluated video retrieval tasks at TRECVID 2022

Figure 3 for An overview on the evaluated video retrieval tasks at TRECVID 2022

Figure 4 for An overview on the evaluated video retrieval tasks at TRECVID 2022

Abstract:The TREC Video Retrieval Evaluation (TRECVID) is a TREC-style video analysis and retrieval evaluation with the goal of promoting progress in research and development of content-based exploitation and retrieval of information from digital video via open, tasks-based evaluation supported by metrology. Over the last twenty-one years this effort has yielded a better understanding of how systems can effectively accomplish such processing and how one can reliably benchmark their performance. TRECVID has been funded by NIST (National Institute of Standards and Technology) and other US government agencies. In addition, many organizations and individuals worldwide contribute significant time and effort. TRECVID 2022 planned for the following six tasks: Ad-hoc video search, Video to text captioning, Disaster scene description and indexing, Activity in extended videos, deep video understanding, and movie summarization. In total, 35 teams from various research organizations worldwide signed up to join the evaluation campaign this year. This paper introduces the tasks, datasets used, evaluation frameworks and metrics, as well as a high-level results overview.

* arXiv admin note: substantial text overlap with arXiv:2104.13473, arXiv:2009.09984

Via

Access Paper or Ask Questions

Is a Video worth $n\times n$ Images? A Highly Efficient Approach to Transformer-based Video Question Answering

May 16, 2023

Chenyang Lyu, Tianbo Ji, Yvette Graham, Jennifer Foster

$Figure 1 for Is a Video worth $n\times n$ Images? A Highly Efficient Approach to Transformer-based Video Question Answering$

$Figure 2 for Is a Video worth $n\times n$ Images? A Highly Efficient Approach to Transformer-based Video Question Answering$

$Figure 3 for Is a Video worth $n\times n$ Images? A Highly Efficient Approach to Transformer-based Video Question Answering$

$Figure 4 for Is a Video worth $n\times n$ Images? A Highly Efficient Approach to Transformer-based Video Question Answering$

Abstract:Conventional Transformer-based Video Question Answering (VideoQA) approaches generally encode frames independently through one or more image encoders followed by interaction between frames and question. However, such schema would incur significant memory use and inevitably slow down the training and inference speed. In this work, we present a highly efficient approach for VideoQA based on existing vision-language pre-trained models where we concatenate video frames to a $n\times n$ matrix and then convert it to one image. By doing so, we reduce the use of the image encoder from $n^{2}$ to $1$ while maintaining the temporal structure of the original video. Experimental results on MSRVTT and TrafficQA show that our proposed approach achieves state-of-the-art performance with nearly $4\times$ faster speed and only 30% memory use. We show that by integrating our approach into VideoQA systems we can achieve comparable, even superior, performance with a significant speed up for training and inference. We believe the proposed approach can facilitate VideoQA-related research by reducing the computational requirements for those who have limited access to budgets and resources. Our code will be made publicly available for research use.

Via

Access Paper or Ask Questions

Semantic-aware Dynamic Retrospective-Prospective Reasoning for Event-level Video Question Answering

May 14, 2023

Chenyang Lyu, Tianbo Ji, Yvette Graham, Jennifer Foster

Figure 1 for Semantic-aware Dynamic Retrospective-Prospective Reasoning for Event-level Video Question Answering

Figure 2 for Semantic-aware Dynamic Retrospective-Prospective Reasoning for Event-level Video Question Answering

Figure 3 for Semantic-aware Dynamic Retrospective-Prospective Reasoning for Event-level Video Question Answering

Figure 4 for Semantic-aware Dynamic Retrospective-Prospective Reasoning for Event-level Video Question Answering

Abstract:Event-Level Video Question Answering (EVQA) requires complex reasoning across video events to obtain the visual information needed to provide optimal answers. However, despite significant progress in model performance, few studies have focused on using the explicit semantic connections between the question and visual information especially at the event level. There is need for using such semantic connections to facilitate complex reasoning across video frames. Therefore, we propose a semantic-aware dynamic retrospective-prospective reasoning approach for video-based question answering. Specifically, we explicitly use the Semantic Role Labeling (SRL) structure of the question in the dynamic reasoning process where we decide to move to the next frame based on which part of the SRL structure (agent, verb, patient, etc.) of the question is being focused on. We conduct experiments on a benchmark EVQA dataset - TrafficQA. Results show that our proposed approach achieves superior performance compared to previous state-of-the-art models. Our code will be made publicly available for research use.

Via

Access Paper or Ask Questions

Exploiting Rich Textual User-Product Context for Improving Sentiment Analysis

Dec 17, 2022

Chenyang Lyu, Linyi Yang, Yue Zhang, Yvette Graham, Jennifer Foster

Figure 1 for Exploiting Rich Textual User-Product Context for Improving Sentiment Analysis

Figure 2 for Exploiting Rich Textual User-Product Context for Improving Sentiment Analysis

Figure 3 for Exploiting Rich Textual User-Product Context for Improving Sentiment Analysis

Figure 4 for Exploiting Rich Textual User-Product Context for Improving Sentiment Analysis

Abstract:User and product information associated with a review is useful for sentiment polarity prediction. Typical approaches incorporating such information focus on modeling users and products as implicitly learned representation vectors. Most do not exploit the potential of historical reviews, or those that currently do require unnecessary modifications to model architecture or do not make full use of user/product associations. The contribution of this work is twofold: i) a method to explicitly employ historical reviews belonging to the same user/product to initialize representations, and ii) efficient incorporation of textual associations between users and products via a user-product cross-context module. Experiments on IMDb, Yelp-2013 and Yelp-2014 benchmarks show that our approach substantially outperforms previous state-of-the-art. Since we employ BERT-base as the encoder, we additionally provide experiments in which our approach performs well with Span-BERT and Longformer. Furthermore, experiments where the reviews of each user/product in the training data are downsampled demonstrate the effectiveness of our approach under a low-resource setting.

Via

Access Paper or Ask Questions