Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yongxiu Xu

Does Unification Come at a Cost? Uni-SafeBench: A Safety Benchmark for Unified Multimodal Large Models

Apr 01, 2026

Zixiang Peng, Yongxiu Xu, Qinyi Zhang, Jiexun Shen, Yifan Zhang, Hongbo Xu, Yubin Wang, Gaopeng Gou

Abstract:Unified Multimodal Large Models (UMLMs) integrate understanding and generation capabilities within a single architecture. While this architectural unification, driven by the deep fusion of multimodal features, enhances model performance, it also introduces important yet underexplored safety challenges. Existing safety benchmarks predominantly focus on isolated understanding or generation tasks, failing to evaluate the holistic safety of UMLMs when handling diverse tasks under a unified framework. To address this, we introduce Uni-SafeBench, a comprehensive benchmark featuring a taxonomy of six major safety categories across seven task types. To ensure rigorous assessment, we develop Uni-Judger, a framework that effectively decouples contextual safety from intrinsic safety. Based on comprehensive evaluations across Uni-SafeBench, we uncover that while the unification process enhances model capabilities, it significantly degrades the inherent safety of the underlying LLM. Furthermore, open-source UMLMs exhibit much lower safety performance than multimodal large models specialized for either generation or understanding tasks. We open-source all resources to systematically expose these risks and foster safer AGI development.

Via

Access Paper or Ask Questions

REMOTE: A Unified Multimodal Relation Extraction Framework with Multilevel Optimal Transport and Mixture-of-Experts

Sep 05, 2025

Xinkui Lin, Yongxiu Xu, Minghao Tang, Shilong Zhang, Hongbo Xu, Hao Xu, Yubin Wang

Figure 1 for REMOTE: A Unified Multimodal Relation Extraction Framework with Multilevel Optimal Transport and Mixture-of-Experts

Figure 2 for REMOTE: A Unified Multimodal Relation Extraction Framework with Multilevel Optimal Transport and Mixture-of-Experts

Figure 3 for REMOTE: A Unified Multimodal Relation Extraction Framework with Multilevel Optimal Transport and Mixture-of-Experts

Figure 4 for REMOTE: A Unified Multimodal Relation Extraction Framework with Multilevel Optimal Transport and Mixture-of-Experts

Abstract:Multimodal relation extraction (MRE) is a crucial task in the fields of Knowledge Graph and Multimedia, playing a pivotal role in multimodal knowledge graph construction. However, existing methods are typically limited to extracting a single type of relational triplet, which restricts their ability to extract triplets beyond the specified types. Directly combining these methods fails to capture dynamic cross-modal interactions and introduces significant computational redundancy. Therefore, we propose a novel \textit{unified multimodal Relation Extraction framework with Multilevel Optimal Transport and mixture-of-Experts}, termed REMOTE, which can simultaneously extract intra-modal and inter-modal relations between textual entities and visual objects. To dynamically select optimal interaction features for different types of relational triplets, we introduce mixture-of-experts mechanism, ensuring the most relevant modality information is utilized. Additionally, considering that the inherent property of multilayer sequential encoding in existing encoders often leads to the loss of low-level information, we adopt a multilevel optimal transport fusion module to preserve low-level features while maintaining multilayer encoding, yielding more expressive representations. Correspondingly, we also create a Unified Multimodal Relation Extraction (UMRE) dataset to evaluate the effectiveness of our framework, encompassing diverse cases where the head and tail entities can originate from either text or image. Extensive experiments show that REMOTE effectively extracts various types of relational triplets and achieves state-of-the-art performanc on almost all metrics across two other public MRE datasets. We release our resources at https://github.com/Nikol-coder/REMOTE.

* ACM MM 2025

Via

Access Paper or Ask Questions

MMIDR: Teaching Large Language Model to Interpret Multimodal Misinformation via Knowledge Distillation

Mar 21, 2024

Longzheng Wang, Xiaohan Xu, Lei Zhang, Jiarui Lu, Yongxiu Xu, Hongbo Xu, Chuang Zhang

Figure 1 for MMIDR: Teaching Large Language Model to Interpret Multimodal Misinformation via Knowledge Distillation

Figure 2 for MMIDR: Teaching Large Language Model to Interpret Multimodal Misinformation via Knowledge Distillation

Figure 3 for MMIDR: Teaching Large Language Model to Interpret Multimodal Misinformation via Knowledge Distillation

Figure 4 for MMIDR: Teaching Large Language Model to Interpret Multimodal Misinformation via Knowledge Distillation

Abstract:Automatic detection of multimodal misinformation has gained a widespread attention recently. However, the potential of powerful Large Language Models (LLMs) for multimodal misinformation detection remains underexplored. Besides, how to teach LLMs to interpret multimodal misinformation in cost-effective and accessible way is still an open question. To address that, we propose MMIDR, a framework designed to teach LLMs in providing fluent and high-quality textual explanations for their decision-making process of multimodal misinformation. To convert multimodal misinformation into an appropriate instruction-following format, we present a data augmentation perspective and pipeline. This pipeline consists of a visual information processing module and an evidence retrieval module. Subsequently, we prompt the proprietary LLMs with processed contents to extract rationales for interpreting the authenticity of multimodal misinformation. Furthermore, we design an efficient knowledge distillation approach to distill the capability of proprietary LLMs in explaining multimodal misinformation into open-source LLMs. To explore several research questions regarding the performance of LLMs in multimodal misinformation detection tasks, we construct an instruction-following multimodal misinformation dataset and conduct comprehensive experiments. The experimental findings reveal that our MMIDR exhibits sufficient detection performance and possesses the capacity to provide compelling rationales to support its assessments.

* 10 pages, 3 figures

Via

Access Paper or Ask Questions

S2F-NER: Exploring Sequence-to-Forest Generation for Complex Entity Recognition

Oct 29, 2023

Yongxiu Xu, Heyan Huang, Yue Hu

Abstract:Named Entity Recognition (NER) remains challenging due to the complex entities, like nested, overlapping, and discontinuous entities. Existing approaches, such as sequence-to-sequence (Seq2Seq) generation and span-based classification, have shown impressive performance on various NER subtasks, but they are difficult to scale to datasets with longer input text because of either exposure bias issue or inefficient computation. In this paper, we propose a novel Sequence-to-Forest generation paradigm, S2F-NER, which can directly extract entities in sentence via a Forest decoder that decode multiple entities in parallel rather than sequentially. Specifically, our model generate each path of each tree in forest autoregressively, where the maximum depth of each tree is three (which is the shortest feasible length for complex NER and is far smaller than the decoding length of Seq2Seq). Based on this novel paradigm, our model can elegantly mitigates the exposure bias problem and keep the simplicity of Seq2Seq. Experimental results show that our model significantly outperforms the baselines on three discontinuous NER datasets and on two nested NER datasets, especially for discontinuous entity recognition.

Via

Access Paper or Ask Questions

A Boundary Offset Prediction Network for Named Entity Recognition

Oct 23, 2023

Minghao Tang, Yongquan He, Yongxiu Xu, Hongbo Xu, Wenyuan Zhang, Yang Lin

Abstract:Named entity recognition (NER) is a fundamental task in natural language processing that aims to identify and classify named entities in text. However, span-based methods for NER typically assign entity types to text spans, resulting in an imbalanced sample space and neglecting the connections between non-entity and entity spans. To address these issues, we propose a novel approach for NER, named the Boundary Offset Prediction Network (BOPN), which predicts the boundary offsets between candidate spans and their nearest entity spans. By leveraging the guiding semantics of boundary offsets, BOPN establishes connections between non-entity and entity spans, enabling non-entity spans to function as additional positive samples for entity detection. Furthermore, our method integrates entity type and span representations to generate type-aware boundary offsets instead of using entity types as detection targets. We conduct experiments on eight widely-used NER datasets, and the results demonstrate that our proposed BOPN outperforms previous state-of-the-art methods.

* Accepted by Findings of EMNLP 2023, 13 pages

Via

Access Paper or Ask Questions

Learning to Correct Noisy Labels for Fine-Grained Entity Typing via Co-Prediction Prompt Tuning

Oct 23, 2023

Minghao Tang, Yongquan He, Yongxiu Xu, Hongbo Xu, Wenyuan Zhang, Yang Lin

Figure 1 for Learning to Correct Noisy Labels for Fine-Grained Entity Typing via Co-Prediction Prompt Tuning

Figure 2 for Learning to Correct Noisy Labels for Fine-Grained Entity Typing via Co-Prediction Prompt Tuning

Figure 3 for Learning to Correct Noisy Labels for Fine-Grained Entity Typing via Co-Prediction Prompt Tuning

Figure 4 for Learning to Correct Noisy Labels for Fine-Grained Entity Typing via Co-Prediction Prompt Tuning

Abstract:Fine-grained entity typing (FET) is an essential task in natural language processing that aims to assign semantic types to entities in text. However, FET poses a major challenge known as the noise labeling problem, whereby current methods rely on estimating noise distribution to identify noisy labels but are confused by diverse noise distribution deviation. To address this limitation, we introduce Co-Prediction Prompt Tuning for noise correction in FET, which leverages multiple prediction results to identify and correct noisy labels. Specifically, we integrate prediction results to recall labeled labels and utilize a differentiated margin to identify inaccurate labels. Moreover, we design an optimization objective concerning divergent co-predictions during fine-tuning, ensuring that the model captures sufficient information and maintains robustness in noise identification. Experimental results on three widely-used FET datasets demonstrate that our noise correction approach significantly enhances the quality of various types of training samples, including those annotated using distant supervision, ChatGPT, and crowdsourcing.

* Accepted by Findings of EMNLP 2023, 11 pages

Via

Access Paper or Ask Questions