Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yiran Zhang

Reinforcing Multimodal Understanding and Generation with Dual Self-rewards

Jun 09, 2025

Jixiang Hong, Yiran Zhang, Guanzhong Wang, Yi Liu, Ji-Rong Wen, Rui Yan

Abstract:Building upon large language models (LLMs), recent large multimodal models (LMMs) unify cross-model understanding and generation into a single framework. However, LMMs still struggle to achieve accurate image-text alignment, prone to generating text responses contradicting the visual input or failing to follow the text-to-image prompts. Current solutions require external supervision (e.g., human feedback or reward models) and only address unidirectional tasks-either understanding or generation. In this work, based on the observation that understanding and generation are inverse dual tasks, we introduce a self-supervised dual reward mechanism to reinforce the understanding and generation capabilities of LMMs. Specifically, we sample multiple outputs for a given input in one task domain, then reverse the input-output pairs to compute the dual likelihood of the model as self-rewards for optimization. Extensive experimental results on visual understanding and generation benchmarks demonstrate that our method can effectively enhance the performance of the model without any external supervision, especially achieving remarkable improvements in text-to-image tasks.

Via

Access Paper or Ask Questions

Origin Tracer: A Method for Detecting LoRA Fine-Tuning Origins in LLMs

May 26, 2025

Hongyu Liang, Yuting Zheng, Yihan Li, Yiran Zhang, Shiyu Liang

Abstract:As large language models (LLMs) continue to advance, their deployment often involves fine-tuning to enhance performance on specific downstream tasks. However, this customization is sometimes accompanied by misleading claims about the origins, raising significant concerns about transparency and trust within the open-source community. Existing model verification techniques typically assess functional, representational, and weight similarities. However, these approaches often struggle against obfuscation techniques, such as permutations and scaling transformations. To address this limitation, we propose a novel detection method Origin-Tracer that rigorously determines whether a model has been fine-tuned from a specified base model. This method includes the ability to extract the LoRA rank utilized during the fine-tuning process, providing a more robust verification framework. This framework is the first to provide a formalized approach specifically aimed at pinpointing the sources of model fine-tuning. We empirically validated our method on thirty-one diverse open-source models under conditions that simulate real-world obfuscation scenarios. We empirically analyze the effectiveness of our framework and finally, discuss its limitations. The results demonstrate the effectiveness of our approach and indicate its potential to establish new benchmarks for model verification.

Via

Access Paper or Ask Questions

From Generation to Detection: A Multimodal Multi-Task Dataset for Benchmarking Health Misinformation

May 24, 2025

Zhihao Zhang, Yiran Zhang, Xiyue Zhou, Liting Huang, Imran Razzak, Preslav Nakov, Usman Naseem

Abstract:Infodemics and health misinformation have significant negative impact on individuals and society, exacerbating confusion and increasing hesitancy in adopting recommended health measures. Recent advancements in generative AI, capable of producing realistic, human like text and images, have significantly accelerated the spread and expanded the reach of health misinformation, resulting in an alarming surge in its dissemination. To combat the infodemics, most existing work has focused on developing misinformation datasets from social media and fact checking platforms, but has faced limitations in topical coverage, inclusion of AI generation, and accessibility of raw content. To address these issues, we present MM Health, a large scale multimodal misinformation dataset in the health domain consisting of 34,746 news article encompassing both textual and visual information. MM Health includes human-generated multimodal information (5,776 articles) and AI generated multimodal information (28,880 articles) from various SOTA generative AI models. Additionally, We benchmarked our dataset against three tasks (reliability checks, originality checks, and fine-grained AI detection) demonstrating that existing SOTA models struggle to accurately distinguish the reliability and origin of information. Our dataset aims to support the development of misinformation detection across various health scenarios, facilitating the detection of human and machine generated content at multimodal levels.

* Preprint

Via

Access Paper or Ask Questions

Commenting Higher-level Code Unit: Full Code, Reduced Code, or Hierarchical Code Summarization

Mar 13, 2025

Weisong Sun, Yiran Zhang, Jie Zhu, Zhihui Wang, Chunrong Fang, Yonglong Zhang, Yebo Feng, Jiangping Huang, Xingya Wang, Zhi Jin(+1 more)

Abstract:Commenting code is a crucial activity in software development, as it aids in facilitating future maintenance and updates. To enhance the efficiency of writing comments and reduce developers' workload, researchers has proposed various automated code summarization (ACS) techniques to automatically generate comments/summaries for given code units. However, these ACS techniques primarily focus on generating summaries for code units at the method level. There is a significant lack of research on summarizing higher-level code units, such as file-level and module-level code units, despite the fact that summaries of these higher-level code units are highly useful for quickly gaining a macro-level understanding of software components and architecture. To fill this gap, in this paper, we conduct a systematic study on how to use LLMs for commenting higher-level code units, including file level and module level. These higher-level units are significantly larger than method-level ones, which poses challenges in handling long code inputs within LLM constraints and maintaining efficiency. To address these issues, we explore various summarization strategies for ACS of higher-level code units, which can be divided into three types: full code summarization, reduced code summarization, and hierarchical code summarization. The experimental results suggest that for summarizing file-level code units, using the full code is the most effective approach, with reduced code serving as a cost-efficient alternative. However, for summarizing module-level code units, hierarchical code summarization becomes the most promising strategy. In addition, inspired by the research on method-level ACS, we also investigate using the LLM as an evaluator to evaluate the quality of summaries of higher-level code units. The experimental results demonstrate that the LLM's evaluation results strongly correlate with human evaluations.

Via

Access Paper or Ask Questions

TrustNavGPT: Modeling Uncertainty to Improve Trustworthiness of Audio-Guided LLM-Based Robot Navigation

Aug 03, 2024

Xingpeng Sun, Yiran Zhang, Xindi Tang, Amrit Singh Bedi, Aniket Bera

Figure 1 for TrustNavGPT: Modeling Uncertainty to Improve Trustworthiness of Audio-Guided LLM-Based Robot Navigation

Figure 2 for TrustNavGPT: Modeling Uncertainty to Improve Trustworthiness of Audio-Guided LLM-Based Robot Navigation

Figure 3 for TrustNavGPT: Modeling Uncertainty to Improve Trustworthiness of Audio-Guided LLM-Based Robot Navigation

Figure 4 for TrustNavGPT: Modeling Uncertainty to Improve Trustworthiness of Audio-Guided LLM-Based Robot Navigation

Abstract:While LLMs are proficient at processing text in human conversations, they often encounter difficulties with the nuances of verbal instructions and, thus, remain prone to hallucinate trust in human command. In this work, we present TrustNavGPT, an LLM based audio guided navigation agent that uses affective cues in spoken communication elements such as tone and inflection that convey meaning beyond words, allowing it to assess the trustworthiness of human commands and make effective, safe decisions. Our approach provides a lightweight yet effective approach that extends existing LLMs to model audio vocal features embedded in the voice command and model uncertainty for safe robotic navigation.

* IROS 2024

Via

Access Paper or Ask Questions

RU-AI: A Large Multimodal Dataset for Machine Generated Content Detection

Jun 07, 2024

Liting Huang, Zhihao Zhang, Yiran Zhang, Xiyue Zhou, Shoujin Wang

Figure 1 for RU-AI: A Large Multimodal Dataset for Machine Generated Content Detection

Figure 2 for RU-AI: A Large Multimodal Dataset for Machine Generated Content Detection

Figure 3 for RU-AI: A Large Multimodal Dataset for Machine Generated Content Detection

Figure 4 for RU-AI: A Large Multimodal Dataset for Machine Generated Content Detection

Abstract:The recent advancements in generative AI models, which can create realistic and human-like content, are significantly transforming how people communicate, create, and work. While the appropriate use of generative AI models can benefit the society, their misuse poses significant threats to data reliability and authentication. However, due to a lack of aligned multimodal datasets, effective and robust methods for detecting machine-generated content are still in the early stages of development. In this paper, we introduce RU-AI, a new large-scale multimodal dataset designed for the robust and efficient detection of machine-generated content in text, image, and voice. Our dataset is constructed from three large publicly available datasets: Flickr8K, COCO, and Places205, by combining the original datasets and their corresponding machine-generated pairs. Additionally, experimental results show that our proposed unified model, which incorporates a multimodal embedding module with a multilayer perceptron network, can effectively determine the origin of the data (i.e., original data samples or machine-generated ones) from RU-AI. However, future work is still required to address the remaining challenges posed by RU-AI. The source code and dataset are available at https://github.com/ZhihaoZhang97/RU-AI.

Via

Access Paper or Ask Questions

Game-MUG: Multimodal Oriented Game Situation Understanding and Commentary Generation Dataset

Apr 30, 2024

Zhihao Zhang, Feiqi Cao, Yingbin Mo, Yiran Zhang, Josiah Poon, Caren Han

Abstract:The dynamic nature of esports makes the situation relatively complicated for average viewers. Esports broadcasting involves game expert casters, but the caster-dependent game commentary is not enough to fully understand the game situation. It will be richer by including diverse multimodal esports information, including audiences' talks/emotions, game audio, and game match event information. This paper introduces GAME-MUG, a new multimodal game situation understanding and audience-engaged commentary generation dataset and its strong baseline. Our dataset is collected from 2020-2022 LOL game live streams from YouTube and Twitch, and includes multimodal esports game information, including text, audio, and time-series event logs, for detecting the game situation. In addition, we also propose a new audience conversation augmented commentary dataset by covering the game situation and audience conversation understanding, and introducing a robust joint multimodal dual learning model as a baseline. We examine the model's game situation/event understanding ability and commentary generation capability to show the effectiveness of the multimodal aspects coverage and the joint integration learning approach.

Via

Access Paper or Ask Questions

FedRDMA: Communication-Efficient Cross-Silo Federated LLM via Chunked RDMA Transmission

Mar 01, 2024

Zeling Zhang, Dongqi Cai, Yiran Zhang, Mengwei Xu, Shangguang Wang, Ao Zhou

Abstract:Communication overhead is a significant bottleneck in federated learning (FL), which has been exaggerated with the increasing size of AI models. In this paper, we propose FedRDMA, a communication-efficient cross-silo FL system that integrates RDMA into the FL communication protocol. To overcome the limitations of RDMA in wide-area networks (WANs), FedRDMA divides the updated model into chunks and designs a series of optimization techniques to improve the efficiency and robustness of RDMA-based communication. We implement FedRDMA atop the industrial federated learning framework and evaluate it on a real-world cross-silo FL scenario. The experimental results show that \sys can achieve up to 3.8$\times$ speedup in communication efficiency compared to traditional TCP/IP-based FL systems.

* under review

Via

Access Paper or Ask Questions

Brain-Inspired Modelling and Decision-making for Human-Like Autonomous Driving in Mixed Traffic Environment

Jan 13, 2022

Peng Hang, Yiran Zhang, Chen Lv

Figure 1 for Brain-Inspired Modelling and Decision-making for Human-Like Autonomous Driving in Mixed Traffic Environment

Figure 2 for Brain-Inspired Modelling and Decision-making for Human-Like Autonomous Driving in Mixed Traffic Environment

Figure 3 for Brain-Inspired Modelling and Decision-making for Human-Like Autonomous Driving in Mixed Traffic Environment

Figure 4 for Brain-Inspired Modelling and Decision-making for Human-Like Autonomous Driving in Mixed Traffic Environment

Abstract:In this paper, a human-like driving framework is designed for autonomous vehicles (AVs), which aims to make AVs better integrate into the transportation ecology of human driving and eliminate the misunderstanding and incompatibility of human drivers to autonomous driving. Based on the analysis of the real world INTERACTION dataset, a driving aggressiveness estimation model is established with the fuzzy inference approach. Then, a human-like driving model, which integrates the brain emotional learning circuit model (BELCM) with the two-point preview model, is designed. In the human-like lane-change decision-making algorithm, the cost function is designed comprehensively considering driving safety and travel efficiency. Based on the cost function and multi-constraint, the dynamic game algorithm is applied to modelling the interaction and decision making between AV and human driver. Additionally, to guarantee the lane-change safety of AVs, an artificial potential field model is built for collision risk assessment. Finally, the proposed algorithm is evaluated through human-in-the-loop experiments on a driving simulator, and the results demonstrated the feasibility and effectiveness of the proposed method.

* This work has been submitted to IEEE Transactions on Intelligent Transportation Systems

Via

Access Paper or Ask Questions

A Marker-free Head Tracker Using Vision-based Head Pose Estimation with Adaptive Kalman Filter

Mar 24, 2021

Zhongxu Hu, Chen Lv, Yanxin Zhou, Yiran Zhang, Wenhui Huang

Figure 1 for A Marker-free Head Tracker Using Vision-based Head Pose Estimation with Adaptive Kalman Filter

Figure 2 for A Marker-free Head Tracker Using Vision-based Head Pose Estimation with Adaptive Kalman Filter

Figure 3 for A Marker-free Head Tracker Using Vision-based Head Pose Estimation with Adaptive Kalman Filter

Figure 4 for A Marker-free Head Tracker Using Vision-based Head Pose Estimation with Adaptive Kalman Filter

Abstract:The immersion and the interaction are the important features of the driving simulator. To improve these characteristics, this paper proposes a low-cost and mark-less driver head tracking framework based on the head pose estimation model, which makes the view of the simulator can automatically align with the driver's head pose. The proposed method only uses the RGB camera without the other hardware or marker. To handle the error of the head pose estimation model, this paper proposes an adaptive Kalman Filter. By analyzing the error distribution of the estimation model and user experience, the proposed Kalman Filter includes the adaptive observation noise coefficient and loop closure module, which can adaptive moderate the smoothness of the curve and keep the curve stable near the initial position. The experiments show that the proposed method is feasible, and it can be used with different head pose estimation models.

Via

Access Paper or Ask Questions