Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Rémi Lebret

WikiMixQA: A Multimodal Benchmark for Question Answering over Tables and Charts

Jun 18, 2025

Negar Foroutan, Angelika Romanou, Matin Ansaripour, Julian Martin Eisenschlos, Karl Aberer, Rémi Lebret

Abstract:Documents are fundamental to preserving and disseminating information, often incorporating complex layouts, tables, and charts that pose significant challenges for automatic document understanding (DU). While vision-language large models (VLLMs) have demonstrated improvements across various tasks, their effectiveness in processing long-context vision inputs remains unclear. This paper introduces WikiMixQA, a benchmark comprising 1,000 multiple-choice questions (MCQs) designed to evaluate cross-modal reasoning over tables and charts extracted from 4,000 Wikipedia pages spanning seven distinct topics. Unlike existing benchmarks, WikiMixQA emphasizes complex reasoning by requiring models to synthesize information from multiple modalities. We evaluate 12 state-of-the-art vision-language models, revealing that while proprietary models achieve ~70% accuracy when provided with direct context, their performance deteriorates significantly when retrieval from long documents is required. Among these, GPT-4-o is the only model exceeding 50% accuracy in this setting, whereas open-source models perform considerably worse, with a maximum accuracy of 27%. These findings underscore the challenges of long-context, multi-modal reasoning and establish WikiMixQA as a crucial benchmark for advancing document understanding research.

* ACL 2025 (Findings)

Via

Access Paper or Ask Questions

Stance Detection on Social Media with Fine-Tuned Large Language Models

Apr 18, 2024

İlker Gül, Rémi Lebret, Karl Aberer

Abstract:Stance detection, a key task in natural language processing, determines an author's viewpoint based on textual analysis. This study evaluates the evolution of stance detection methods, transitioning from early machine learning approaches to the groundbreaking BERT model, and eventually to modern Large Language Models (LLMs) such as ChatGPT, LLaMa-2, and Mistral-7B. While ChatGPT's closed-source nature and associated costs present challenges, the open-source models like LLaMa-2 and Mistral-7B offers an encouraging alternative. Initially, our research focused on fine-tuning ChatGPT, LLaMa-2, and Mistral-7B using several publicly available datasets. Subsequently, to provide a comprehensive comparison, we assess the performance of these models in zero-shot and few-shot learning scenarios. The results underscore the exceptional ability of LLMs in accurately detecting stance, with all tested models surpassing existing benchmarks. Notably, LLaMa-2 and Mistral-7B demonstrate remarkable efficiency and potential for stance detection, despite their smaller sizes compared to ChatGPT. This study emphasizes the potential of LLMs in stance detection and calls for more extensive research in this field.

Via

Access Paper or Ask Questions

Stop Pre-Training: Adapt Visual-Language Models to Unseen Languages

Jun 29, 2023

Yasmine Karoui, Rémi Lebret, Negar Foroutan, Karl Aberer

Figure 1 for Stop Pre-Training: Adapt Visual-Language Models to Unseen Languages

Figure 2 for Stop Pre-Training: Adapt Visual-Language Models to Unseen Languages

Figure 3 for Stop Pre-Training: Adapt Visual-Language Models to Unseen Languages

Figure 4 for Stop Pre-Training: Adapt Visual-Language Models to Unseen Languages

Abstract:Vision-Language Pre-training (VLP) has advanced the performance of many vision-language tasks, such as image-text retrieval, visual entailment, and visual reasoning. The pre-training mostly utilizes lexical databases and image queries in English. Previous work has demonstrated that the pre-training in English does not transfer well to other languages in a zero-shot setting. However, multilingual pre-trained language models (MPLM) have excelled at a variety of single-modal language tasks. In this paper, we propose a simple yet efficient approach to adapt VLP to unseen languages using MPLM. We utilize a cross-lingual contextualized token embeddings alignment approach to train text encoders for non-English languages. Our approach does not require image input and primarily uses machine translation, eliminating the need for target language data. Our evaluation across three distinct tasks (image-text retrieval, visual entailment, and natural language visual reasoning) demonstrates that this approach outperforms the state-of-the-art multilingual vision-language models without requiring large parallel corpora. Our code is available at https://github.com/Yasminekaroui/CliCoTea.

* Accepted to ACL 2023 as short paper

Via

Access Paper or Ask Questions

Revisiting Offline Compression: Going Beyond Factorization-based Methods for Transformer Language Models

Feb 08, 2023

Mohammadreza Banaei, Klaudia Bałazy, Artur Kasymov, Rémi Lebret, Jacek Tabor, Karl Aberer

Abstract:Recent transformer language models achieve outstanding results in many natural language processing (NLP) tasks. However, their enormous size often makes them impractical on memory-constrained devices, requiring practitioners to compress them to smaller networks. In this paper, we explore offline compression methods, meaning computationally-cheap approaches that do not require further fine-tuning of the compressed model. We challenge the classical matrix factorization methods by proposing a novel, better-performing autoencoder-based framework. We perform a comprehensive ablation study of our approach, examining its different aspects over a diverse set of evaluation settings. Moreover, we show that enabling collaboration between modules across layers by compressing certain modules together positively impacts the final model performance. Experiments on various NLP tasks demonstrate that our approach significantly outperforms commonly used factorization-based offline compression methods.

Via

Access Paper or Ask Questions

An Efficient Active Learning Pipeline for Legal Text Classification

Nov 15, 2022

Sepideh Mamooler, Rémi Lebret, Stéphane Massonnet, Karl Aberer

Abstract:Active Learning (AL) is a powerful tool for learning with less labeled data, in particular, for specialized domains, like legal documents, where unlabeled data is abundant, but the annotation requires domain expertise and is thus expensive. Recent works have shown the effectiveness of AL strategies for pre-trained language models. However, most AL strategies require a set of labeled samples to start with, which is expensive to acquire. In addition, pre-trained language models have been shown unstable during fine-tuning with small datasets, and their embeddings are not semantically meaningful. In this work, we propose a pipeline for effectively using active learning with pre-trained language models in the legal domain. To this end, we leverage the available unlabeled data in three phases. First, we continue pre-training the model to adapt it to the downstream task. Second, we use knowledge distillation to guide the model's embeddings to a semantically meaningful space. Finally, we propose a simple, yet effective, strategy to find the initial set of labeled samples with fewer actions compared to existing methods. Our experiments on Contract-NLI, adapted to the classification task, and LEDGAR benchmarks show that our approach outperforms standard AL strategies, and is more efficient. Furthermore, our pipeline reaches comparable results to the fully-supervised approach with a small performance gap, and dramatically reduced annotation cost. Code and the adapted data will be made available.

Via

Access Paper or Ask Questions

AdaGrid: Adaptive Grid Search for Link Prediction Training Objective

Mar 30, 2022

Tim Poštuvan, Jiaxuan You, Mohammadreza Banaei, Rémi Lebret, Jure Leskovec

Figure 1 for AdaGrid: Adaptive Grid Search for Link Prediction Training Objective

Figure 2 for AdaGrid: Adaptive Grid Search for Link Prediction Training Objective

Figure 3 for AdaGrid: Adaptive Grid Search for Link Prediction Training Objective

Figure 4 for AdaGrid: Adaptive Grid Search for Link Prediction Training Objective

Abstract:One of the most important factors that contribute to the success of a machine learning model is a good training objective. Training objective crucially influences the model's performance and generalization capabilities. This paper specifically focuses on graph neural network training objective for link prediction, which has not been explored in the existing literature. Here, the training objective includes, among others, a negative sampling strategy, and various hyperparameters, such as edge message ratio which controls how training edges are used. Commonly, these hyperparameters are fine-tuned by complete grid search, which is very time-consuming and model-dependent. To mitigate these limitations, we propose Adaptive Grid Search (AdaGrid), which dynamically adjusts the edge message ratio during training. It is model agnostic and highly scalable with a fully customizable computational budget. Through extensive experiments, we show that AdaGrid can boost the performance of the models up to $1.9\%$ while being nine times more time-efficient than a complete search. Overall, AdaGrid represents an effective automated algorithm for designing machine learning training objectives.

Via

Access Paper or Ask Questions

Legal Transformer Models May Not Always Help

Sep 15, 2021

Saibo Geng, Rémi Lebret, Karl Aberer

Figure 1 for Legal Transformer Models May Not Always Help

Figure 2 for Legal Transformer Models May Not Always Help

Figure 3 for Legal Transformer Models May Not Always Help

Figure 4 for Legal Transformer Models May Not Always Help

Abstract:Deep learning-based Natural Language Processing methods, especially transformers, have achieved impressive performance in the last few years. Applying those state-of-the-art NLP methods to legal activities to automate or simplify some simple work is of great value. This work investigates the value of domain adaptive pre-training and language adapters in legal NLP tasks. By comparing the performance of language models with domain adaptive pre-training on different tasks and different dataset splits, we show that domain adaptive pre-training is only helpful with low-resource downstream tasks, thus far from being a panacea. We also benchmark the performance of adapters in a typical legal NLP task and show that they can yield similar performance to full model tuning with much smaller training costs. As an additional result, we release LegalRoBERTa, a RoBERTa model further pre-trained on legal corpora.

Via

Access Paper or Ask Questions

Direction is what you need: Improving Word Embedding Compression in Large Language Models

Jun 15, 2021

Klaudia Bałazy, Mohammadreza Banaei, Rémi Lebret, Jacek Tabor, Karl Aberer

Figure 1 for Direction is what you need: Improving Word Embedding Compression in Large Language Models

Figure 2 for Direction is what you need: Improving Word Embedding Compression in Large Language Models

Figure 3 for Direction is what you need: Improving Word Embedding Compression in Large Language Models

Figure 4 for Direction is what you need: Improving Word Embedding Compression in Large Language Models

Abstract:The adoption of Transformer-based models in natural language processing (NLP) has led to great success using a massive number of parameters. However, due to deployment constraints in edge devices, there has been a rising interest in the compression of these models to improve their inference time and memory footprint. This paper presents a novel loss objective to compress token embeddings in the Transformer-based models by leveraging an AutoEncoder architecture. More specifically, we emphasize the importance of the direction of compressed embeddings with respect to original uncompressed embeddings. The proposed method is task-agnostic and does not require further language modeling pre-training. Our method significantly outperforms the commonly used SVD-based matrix-factorization approach in terms of initial language model Perplexity. Moreover, we evaluate our proposed approach over SQuAD v1.1 dataset and several downstream tasks from the GLUE benchmark, where we also outperform the baseline in most scenarios. Our code is public.

Via

Access Paper or Ask Questions

Spoken dialect identification in Twitter using a multi-filter architecture

Jun 05, 2020

Mohammadreza Banaei, Rémi Lebret, Karl Aberer

Figure 1 for Spoken dialect identification in Twitter using a multi-filter architecture

Figure 2 for Spoken dialect identification in Twitter using a multi-filter architecture

Figure 3 for Spoken dialect identification in Twitter using a multi-filter architecture

Figure 4 for Spoken dialect identification in Twitter using a multi-filter architecture

Abstract:This paper presents our approach for SwissText & KONVENS 2020 shared task 2, which is a multi-stage neural model for Swiss German (GSW) identification on Twitter. Our model outputs either GSW or non-GSW and is not meant to be used as a generic language identifier. Our architecture consists of two independent filters where the first one favors recall, and the second one filter favors precision (both towards GSW). Moreover, we do not use binary models (GSW vs. not-GSW) in our filters but rather a multi-class classifier with GSW being one of the possible labels. Our model reaches F1-score of 0.982 on the test set of the shared task.

Via

Access Paper or Ask Questions

Upgrading the Newsroom: An Automated Image Selection System for News Articles

Apr 23, 2020

Fangyu Liu, Rémi Lebret, Didier Orel, Philippe Sordet, Karl Aberer

Figure 1 for Upgrading the Newsroom: An Automated Image Selection System for News Articles

Figure 2 for Upgrading the Newsroom: An Automated Image Selection System for News Articles

Figure 3 for Upgrading the Newsroom: An Automated Image Selection System for News Articles

Figure 4 for Upgrading the Newsroom: An Automated Image Selection System for News Articles

Abstract:We propose an automated image selection system to assist photo editors in selecting suitable images for news articles. The system fuses multiple textual sources extracted from news articles and accepts multilingual inputs. It is equipped with char-level word embeddings to help both modeling morphologically rich languages, e.g. German, and transferring knowledge across nearby languages. The text encoder adopts a hierarchical self-attention mechanism to attend more to both keywords within a piece of text and informative components of a news article. We extensively experiment with our system on a large-scale text-image database containing multimodal multilingual news articles collected from Swiss local news media websites. The system is compared with multiple baselines with ablation studies and is shown to beat existing text-image retrieval methods in a weakly-supervised learning setting. Besides, we also offer insights on the advantage of using multiple textual sources and multilingual data.

* Accepted to ACM Transactions on Multimedia Computing Communications and Applications (ACM TOMM)

Via

Access Paper or Ask Questions