Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Antonios Anastasopoulos

Dialectal Toxicity Detection: Evaluating LLM-as-a-Judge Consistency Across Language Varieties

Nov 17, 2024

Fahim Faisal, Md Mushfiqur Rahman, Antonios Anastasopoulos

Abstract:There has been little systematic study on how dialectal differences affect toxicity detection by modern LLMs. Furthermore, although using LLMs as evaluators ("LLM-as-a-judge") is a growing research area, their sensitivity to dialectal nuances is still underexplored and requires more focused attention. In this paper, we address these gaps through a comprehensive toxicity evaluation of LLMs across diverse dialects. We create a multi-dialect dataset through synthetic transformations and human-assisted translations, covering 10 language clusters and 60 varieties. We then evaluated three LLMs on their ability to assess toxicity across multilingual, dialectal, and LLM-human consistency. Our findings show that LLMs are sensitive in handling both multilingual and dialectal variations. However, if we have to rank the consistency, the weakest area is LLM-human agreement, followed by dialectal consistency. Code repository: \url{https://github.com/ffaisal93/dialect_toxicity_llm_judge}

Via

Access Paper or Ask Questions

Findings of the IWSLT 2024 Evaluation Campaign

Nov 07, 2024

Ibrahim Said Ahmad, Antonios Anastasopoulos, Ondřej Bojar, Claudia Borg, Marine Carpuat, Roldano Cattoni, Mauro Cettolo, William Chen, Qianqian Dong, Marcello Federico(+35 more)

Abstract:This paper reports on the shared tasks organized by the 21st IWSLT Conference. The shared tasks address 7 scientific challenges in spoken language translation: simultaneous and offline translation, automatic subtitling and dubbing, speech-to-speech translation, dialect and low-resource speech translation, and Indic languages. The shared tasks attracted 18 teams whose submissions are documented in 26 system papers. The growing interest towards spoken language translation is also witnessed by the constantly increasing number of shared task organizers and contributors to the overview paper, almost evenly distributed across industry and academia.

* IWSLT 2024; 59 pages

Via

Access Paper or Ask Questions

Birdie: Advancing State Space Models with Reward-Driven Objectives and Curricula

Nov 05, 2024

Sam Blouir, Jimmy T. H. Smith, Antonios Anastasopoulos, Amarda Shehu

Figure 1 for Birdie: Advancing State Space Models with Reward-Driven Objectives and Curricula

Figure 2 for Birdie: Advancing State Space Models with Reward-Driven Objectives and Curricula

Figure 3 for Birdie: Advancing State Space Models with Reward-Driven Objectives and Curricula

Figure 4 for Birdie: Advancing State Space Models with Reward-Driven Objectives and Curricula

Abstract:Efficient state space models (SSMs), such as linear recurrent neural networks and linear attention variants, offer computational advantages over Transformers but struggle with tasks requiring long-range in-context retrieval-like text copying, associative recall, and question answering over long contexts. Previous efforts to address these challenges have focused on architectural modifications, often reintroducing computational inefficiencies. In this paper, we propose a novel training procedure, Birdie, that significantly enhances the in-context retrieval capabilities of SSMs without altering their architecture. Our approach combines bidirectional input processing with dynamic mixtures of specialized pre-training objectives, optimized via reinforcement learning. We introduce a new bidirectional SSM architecture that seamlessly transitions from bidirectional context processing to causal generation. Experimental evaluations demonstrate that Birdie markedly improves performance on retrieval-intensive tasks such as multi-number phone book lookup, long paragraph question-answering, and infilling. This narrows the performance gap with Transformers, while retaining computational efficiency. Our findings highlight the importance of training procedures in leveraging the fixed-state capacity of SSMs, offering a new direction to advance their capabilities. All code and pre-trained models are available at https://www.github.com/samblouir/birdie, with support for JAX and PyTorch.

* Accepted to EMNLP 2024 (Main Conference)

Via

Access Paper or Ask Questions

Back to School: Translation Using Grammar Books

Oct 20, 2024

Jonathan Hus, Antonios Anastasopoulos

Figure 1 for Back to School: Translation Using Grammar Books

Figure 2 for Back to School: Translation Using Grammar Books

Figure 3 for Back to School: Translation Using Grammar Books

Figure 4 for Back to School: Translation Using Grammar Books

Abstract:Machine translation systems for high resource languages perform exceptionally well and produce high quality translations. Unfortunately, the vast majority of languages are not considered high resource and lack the quantity of parallel sentences needed to train such systems. These under-represented languages are not without resources, however, and bilingual dictionaries and grammar books are available as linguistic reference material. With current large language models (LLMs) supporting near book-length contexts, we can begin to use the available material to ensure advancements are shared among all of the world's languages. In this paper, we demonstrate incorporating grammar books in the prompt of GPT-4 to improve machine translation and evaluate the performance on 16 topologically diverse low-resource languages, using a combination of reference material to show that the machine translation performance of LLMs can be improved using this method.

Via

Access Paper or Ask Questions

mHumanEval -- A Multilingual Benchmark to Evaluate Large Language Models for Code Generation

Oct 19, 2024

Nishat Raihan, Antonios Anastasopoulos, Marcos Zampieri

Figure 1 for mHumanEval -- A Multilingual Benchmark to Evaluate Large Language Models for Code Generation

Figure 2 for mHumanEval -- A Multilingual Benchmark to Evaluate Large Language Models for Code Generation

Figure 3 for mHumanEval -- A Multilingual Benchmark to Evaluate Large Language Models for Code Generation

Figure 4 for mHumanEval -- A Multilingual Benchmark to Evaluate Large Language Models for Code Generation

Abstract:Recent advancements in large language models (LLMs) have significantly enhanced code generation from natural language prompts. The HumanEval Benchmark, developed by OpenAI, remains the most widely used code generation benchmark. However, this and other Code LLM benchmarks face critical limitations, particularly in task diversity, test coverage, and linguistic scope. Current evaluations primarily focus on English-to-Python conversion tasks with limited test cases, potentially overestimating model performance. While recent works have addressed test coverage and programming language (PL) diversity, code generation from low-resource language prompts remains largely unexplored. To address this gap, we introduce mHumanEval, an extended benchmark supporting prompts in over 200 natural languages. We employ established machine translation methods to compile the benchmark, coupled with a quality assurance process. Furthermore, we provide expert human translations for 15 diverse natural languages (NLs). We conclude by analyzing the multilingual code generation capabilities of state-of-the-art (SOTA) Code LLMs, offering insights into the current landscape of cross-lingual code generation.

* 30 Pages

Via

Access Paper or Ask Questions

The LLM Effect: Are Humans Truly Using LLMs, or Are They Being Influenced By Them Instead?

Oct 07, 2024

Alexander S. Choi, Syeda Sabrina Akter, JP Singh, Antonios Anastasopoulos

Figure 1 for The LLM Effect: Are Humans Truly Using LLMs, or Are They Being Influenced By Them Instead?

Figure 2 for The LLM Effect: Are Humans Truly Using LLMs, or Are They Being Influenced By Them Instead?

Figure 3 for The LLM Effect: Are Humans Truly Using LLMs, or Are They Being Influenced By Them Instead?

Figure 4 for The LLM Effect: Are Humans Truly Using LLMs, or Are They Being Influenced By Them Instead?

Abstract:Large Language Models (LLMs) have shown capabilities close to human performance in various analytical tasks, leading researchers to use them for time and labor-intensive analyses. However, their capability to handle highly specialized and open-ended tasks in domains like policy studies remains in question. This paper investigates the efficiency and accuracy of LLMs in specialized tasks through a structured user study focusing on Human-LLM partnership. The study, conducted in two stages-Topic Discovery and Topic Assignment-integrates LLMs with expert annotators to observe the impact of LLM suggestions on what is usually human-only analysis. Results indicate that LLM-generated topic lists have significant overlap with human generated topic lists, with minor hiccups in missing document-specific topics. However, LLM suggestions may significantly improve task completion speed, but at the same time introduce anchoring bias, potentially affecting the depth and nuance of the analysis, raising a critical question about the trade-off between increased efficiency and the risk of biased analysis.

* Accepted to EMNLP Main 2024. First two authors contributed equally

Via

Access Paper or Ask Questions

BiasDora: Exploring Hidden Biased Associations in Vision-Language Models

Jul 02, 2024

Chahat Raj, Anjishnu Mukherjee, Aylin Caliskan, Antonios Anastasopoulos, Ziwei Zhu

Figure 1 for BiasDora: Exploring Hidden Biased Associations in Vision-Language Models

Figure 2 for BiasDora: Exploring Hidden Biased Associations in Vision-Language Models

Figure 3 for BiasDora: Exploring Hidden Biased Associations in Vision-Language Models

Figure 4 for BiasDora: Exploring Hidden Biased Associations in Vision-Language Models

Abstract:Existing works examining Vision Language Models (VLMs) for social biases predominantly focus on a limited set of documented bias associations, such as gender:profession or race:crime. This narrow scope often overlooks a vast range of unexamined implicit associations, restricting the identification and, hence, mitigation of such biases. We address this gap by probing VLMs to (1) uncover hidden, implicit associations across 9 bias dimensions. We systematically explore diverse input and output modalities and (2) demonstrate how biased associations vary in their negativity, toxicity, and extremity. Our work (3) identifies subtle and extreme biases that are typically not recognized by existing methodologies. We make the Dataset of retrieved associations, (Dora), publicly available here https://github.com/chahatraj/BiasDora.

* Under Review

Via

Access Paper or Ask Questions

Breaking Bias, Building Bridges: Evaluation and Mitigation of Social Biases in LLMs via Contact Hypothesis

Jul 02, 2024

Chahat Raj, Anjishnu Mukherjee, Aylin Caliskan, Antonios Anastasopoulos, Ziwei Zhu

Figure 1 for Breaking Bias, Building Bridges: Evaluation and Mitigation of Social Biases in LLMs via Contact Hypothesis

Figure 2 for Breaking Bias, Building Bridges: Evaluation and Mitigation of Social Biases in LLMs via Contact Hypothesis

Figure 3 for Breaking Bias, Building Bridges: Evaluation and Mitigation of Social Biases in LLMs via Contact Hypothesis

Figure 4 for Breaking Bias, Building Bridges: Evaluation and Mitigation of Social Biases in LLMs via Contact Hypothesis

Abstract:Large Language Models (LLMs) perpetuate social biases, reflecting prejudices in their training data and reinforcing societal stereotypes and inequalities. Our work explores the potential of the Contact Hypothesis, a concept from social psychology for debiasing LLMs. We simulate various forms of social contact through LLM prompting to measure their influence on the model's biases, mirroring how intergroup interactions can reduce prejudices in social contexts. We create a dataset of 108,000 prompts following a principled approach replicating social contact to measure biases in three LLMs (LLaMA 2, Tulu, and NousHermes) across 13 social bias dimensions. We propose a unique debiasing technique, Social Contact Debiasing (SCD), that instruction-tunes these models with unbiased responses to prompts. Our research demonstrates that LLM responses exhibit social biases when subject to contact probing, but more importantly, these biases can be significantly reduced by up to 40% in 1 epoch of instruction tuning LLaMA 2 following our SCD strategy. Our code and data are available at https://github.com/chahatraj/breakingbias.

* Under Review

Via

Access Paper or Ask Questions

Crossroads of Continents: Automated Artifact Extraction for Cultural Adaptation with Large Multimodal Models

Jul 02, 2024

Anjishnu Mukherjee, Ziwei Zhu, Antonios Anastasopoulos

Figure 1 for Crossroads of Continents: Automated Artifact Extraction for Cultural Adaptation with Large Multimodal Models

Figure 2 for Crossroads of Continents: Automated Artifact Extraction for Cultural Adaptation with Large Multimodal Models

Figure 3 for Crossroads of Continents: Automated Artifact Extraction for Cultural Adaptation with Large Multimodal Models

Figure 4 for Crossroads of Continents: Automated Artifact Extraction for Cultural Adaptation with Large Multimodal Models

Abstract:In this work, we present a comprehensive three-phase study to examine (1) the effectiveness of large multimodal models (LMMs) in recognizing cultural contexts; (2) the accuracy of their representations of diverse cultures; and (3) their ability to adapt content across cultural boundaries. We first introduce Dalle Street, a large-scale dataset generated by DALL-E 3 and validated by humans, containing 9,935 images of 67 countries and 10 concept classes. We reveal disparities in cultural understanding at the sub-region level with both open-weight (LLaVA) and closed-source (GPT-4V) models on Dalle Street and other existing benchmarks. Next, we assess models' deeper culture understanding by an artifact extraction task and identify over 18,000 artifacts associated with different countries. Finally, we propose a highly composable pipeline, CultureAdapt, to adapt images from culture to culture. Our findings reveal a nuanced picture of the cultural competence of LMMs, highlighting the need to develop culture-aware systems. Dataset and code are available at https://github.com/iamshnoo/crossroads

* under review

Via

Access Paper or Ask Questions

Gloss2Text: Sign Language Gloss translation using LLMs and Semantically Aware Label Smoothing

Jul 01, 2024

Pooya Fayyazsanavi, Antonios Anastasopoulos, Jana Košecká

Figure 1 for Gloss2Text: Sign Language Gloss translation using LLMs and Semantically Aware Label Smoothing

Figure 2 for Gloss2Text: Sign Language Gloss translation using LLMs and Semantically Aware Label Smoothing

Figure 3 for Gloss2Text: Sign Language Gloss translation using LLMs and Semantically Aware Label Smoothing

Figure 4 for Gloss2Text: Sign Language Gloss translation using LLMs and Semantically Aware Label Smoothing

Abstract:Sign language translation from video to spoken text presents unique challenges owing to the distinct grammar, expression nuances, and high variation of visual appearance across different speakers and contexts. The intermediate gloss annotations of videos aim to guide the translation process. In our work, we focus on {\em Gloss2Text} translation stage and propose several advances by leveraging pre-trained large language models (LLMs), data augmentation, and novel label-smoothing loss function exploiting gloss translation ambiguities improving significantly the performance of state-of-the-art approaches. Through extensive experiments and ablation studies on the PHOENIX Weather 2014T dataset, our approach surpasses state-of-the-art performance in {\em Gloss2Text} translation, indicating its efficacy in addressing sign language translation and suggesting promising avenues for future research and development.

Via

Access Paper or Ask Questions