Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jingbiao Mei

Improved Fine-Tuning of Large Multimodal Models for Hateful Meme Detection

Feb 18, 2025

Jingbiao Mei, Jinghong Chen, Guangyu Yang, Weizhe Lin, Bill Byrne

Abstract:Hateful memes have become a significant concern on the Internet, necessitating robust automated detection systems. While large multimodal models have shown strong generalization across various tasks, they exhibit poor generalization to hateful meme detection due to the dynamic nature of memes tied to emerging social trends and breaking news. Recent work further highlights the limitations of conventional supervised fine-tuning for large multimodal models in this context. To address these challenges, we propose Large Multimodal Model Retrieval-Guided Contrastive Learning (LMM-RGCL), a novel two-stage fine-tuning framework designed to improve both in-domain accuracy and cross-domain generalization. Experimental results on six widely used meme classification datasets demonstrate that LMM-RGCL achieves state-of-the-art performance, outperforming agent-based systems such as VPD-PALI-X-55B. Furthermore, our method effectively generalizes to out-of-domain memes under low-resource settings, surpassing models like GPT-4o.

* Preprint. Under Review

Via

Access Paper or Ask Questions

On Extending Direct Preference Optimization to Accommodate Ties

Sep 25, 2024

Jinghong Chen, Guangyu Yang, Weizhe Lin, Jingbiao Mei, Bill Byrne

Figure 1 for On Extending Direct Preference Optimization to Accommodate Ties

Figure 2 for On Extending Direct Preference Optimization to Accommodate Ties

Figure 3 for On Extending Direct Preference Optimization to Accommodate Ties

Figure 4 for On Extending Direct Preference Optimization to Accommodate Ties

Abstract:We derive and investigate two DPO variants that explicitly model the possibility of declaring a tie in pair-wise comparisons. We replace the Bradley-Terry model in DPO with two well-known modeling extensions, by Rao and Kupper and by Davidson, that assign probability to ties as alternatives to clear preferences. Our experiments in neural machine translation and summarization show that explicitly labeled ties can be added to the datasets for these DPO variants without the degradation in task performance that is observed when the same tied pairs are presented to DPO. We find empirically that the inclusion of ties leads to stronger regularization with respect to the reference policy as measured by KL divergence, and we see this even for DPO in its original form. These findings motivate and enable the inclusion of tied pairs in preference optimization as opposed to simply discarding them.

* 24 pages

Via

Access Paper or Ask Questions

Control-DAG: Constrained Decoding for Non-Autoregressive Directed Acyclic T5 using Weighted Finite State Automata

Apr 10, 2024

Jinghong Chen, Weizhe Lin, Jingbiao Mei, Bill Byrne

Abstract:The Directed Acyclic Transformer is a fast non-autoregressive (NAR) model that performs well in Neural Machine Translation. Two issues prevent its application to general Natural Language Generation (NLG) tasks: frequent Out-Of-Vocabulary (OOV) errors and the inability to faithfully generate entity names. We introduce Control-DAG, a constrained decoding algorithm for our Directed Acyclic T5 (DA-T5) model which offers lexical, vocabulary and length control. We show that Control-DAG significantly enhances DA-T5 on the Schema Guided Dialogue and the DART datasets, establishing strong NAR results for Task-Oriented Dialogue and Data-to-Text NLG.

* 11 pages. NAACL 2024

Via

Access Paper or Ask Questions

PreFLMR: Scaling Up Fine-Grained Late-Interaction Multi-modal Retrievers

Feb 13, 2024

Weizhe Lin, Jingbiao Mei, Jinghong Chen, Bill Byrne

Abstract:Large Multimodal Models (LMMs) excel in natural language and visual understanding but are challenged by exacting tasks such as Knowledge-based Visual Question Answering (KB-VQA) which involve the retrieval of relevant information from document collections to use in shaping answers to questions. We present an extensive training and evaluation framework, M2KR, for KB-VQA. M2KR contains a collection of vision and language tasks which we have incorporated into a single suite of benchmark tasks for training and evaluating general-purpose multi-modal retrievers. We use M2KR to develop PreFLMR, a pre-trained version of the recently developed Fine-grained Late-interaction Multi-modal Retriever (FLMR) approach to KB-VQA, and we report new state-of-the-art results across a range of tasks. We also present investigations into the scaling behaviors of PreFLMR intended to be useful in future developments in general-purpose multi-modal retrievers.

* 8 pages

Via

Access Paper or Ask Questions

Improving hateful memes detection via learning hatefulness-aware embedding space through retrieval-guided contrastive learning

Nov 14, 2023

Jingbiao Mei, Jinghong Chen, Weizhe Lin, Bill Byrne, Marcus Tomalin

Figure 1 for Improving hateful memes detection via learning hatefulness-aware embedding space through retrieval-guided contrastive learning

Figure 2 for Improving hateful memes detection via learning hatefulness-aware embedding space through retrieval-guided contrastive learning

Figure 3 for Improving hateful memes detection via learning hatefulness-aware embedding space through retrieval-guided contrastive learning

Figure 4 for Improving hateful memes detection via learning hatefulness-aware embedding space through retrieval-guided contrastive learning

Abstract:Hateful memes have emerged as a significant concern on the Internet. These memes, which are a combination of image and text, often convey messages vastly different from their individual meanings. Thus, detecting hateful memes requires the system to jointly understand the visual and textual modalities. However, our investigation reveals that the embedding space of existing CLIP-based systems lacks sensitivity to subtle differences in memes that are vital for correct hatefulness classification. To address this issue, we propose constructing a hatefulness-aware embedding space through retrieval-guided contrastive training. Specifically, we add an auxiliary loss that utilizes hard negative and pseudo-gold samples to train the embedding space. Our approach achieves state-of-the-art performance on the HatefulMemes dataset with an AUROC of 86.7. Notably, our approach outperforms much larger fine-tuned Large Multimodal Models like Flamingo and LLaVA. Finally, we demonstrate a retrieval-based hateful memes detection system, which is capable of making hatefulness classification based on data unseen in training from a database. This allows developers to update the hateful memes detection system by simply adding new data without retraining, a desirable feature for real services in the constantly-evolving landscape of hateful memes on the Internet.

Via

Access Paper or Ask Questions

Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering

Sep 29, 2023

Weizhe Lin, Jinghong Chen, Jingbiao Mei, Alexandru Coca, Bill Byrne

Figure 1 for Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering

Figure 2 for Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering

Figure 3 for Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering

Figure 4 for Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering

Abstract:Knowledge-based Visual Question Answering (KB-VQA) requires VQA systems to utilize knowledge from existing knowledge bases to answer visually-grounded questions. Retrieval-Augmented Visual Question Answering (RA-VQA), a strong framework to tackle KB-VQA, first retrieves related documents with Dense Passage Retrieval (DPR) and then uses them to answer questions. This paper proposes Fine-grained Late-interaction Multi-modal Retrieval (FLMR) which significantly improves knowledge retrieval in RA-VQA. FLMR addresses two major limitations in RA-VQA's retriever: (1) the image representations obtained via image-to-text transforms can be incomplete and inaccurate and (2) relevance scores between queries and documents are computed with one-dimensional embeddings, which can be insensitive to finer-grained relevance. FLMR overcomes these limitations by obtaining image representations that complement those from the image-to-text transforms using a vision model aligned with an existing text-based retriever through a simple alignment network. FLMR also encodes images and questions using multi-dimensional embeddings to capture finer-grained relevance between queries and documents. FLMR significantly improves the original RA-VQA retriever's PRRecall@5 by approximately 8\%. Finally, we equipped RA-VQA with two state-of-the-art large multi-modal/language models to achieve $\sim61\%$ VQA score in the OK-VQA dataset.

* To appear at NeurIPS 2023. This is a submission version, and the camera-ready version will be updated soon

Via

Access Paper or Ask Questions

BayesFT: Bayesian Optimization for Fault Tolerant Neural Network Architecture

Sep 30, 2022

Nanyang Ye, Jingbiao Mei, Zhicheng Fang, Yuwen Zhang, Ziqing Zhang, Huaying Wu, Xiaoyao Liang

Figure 1 for BayesFT: Bayesian Optimization for Fault Tolerant Neural Network Architecture

Figure 2 for BayesFT: Bayesian Optimization for Fault Tolerant Neural Network Architecture

Figure 3 for BayesFT: Bayesian Optimization for Fault Tolerant Neural Network Architecture

Figure 4 for BayesFT: Bayesian Optimization for Fault Tolerant Neural Network Architecture

Abstract:To deploy deep learning algorithms on resource-limited scenarios, an emerging device-resistive random access memory (ReRAM) has been regarded as promising via analog computing. However, the practicability of ReRAM is primarily limited due to the weight drifting of ReRAM neural networks due to multi-factor reasons, including manufacturing, thermal noises, and etc. In this paper, we propose a novel Bayesian optimization method for fault tolerant neural network architecture (BayesFT). For neural architecture search space design, instead of conducting neural architecture search on the whole feasible neural architecture search space, we first systematically explore the weight drifting tolerance of different neural network components, such as dropout, normalization, number of layers, and activation functions in which dropout is found to be able to improve the neural network robustness to weight drifting. Based on our analysis, we propose an efficient search space by only searching for dropout rates for each layer. Then, we use Bayesian optimization to search for the optimal neural architecture robust to weight drifting. Empirical experiments demonstrate that our algorithmic framework has outperformed the state-of-the-art methods by up to 10 times on various tasks, such as image classification and object detection.

Via

Access Paper or Ask Questions