Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zhongqi Wang

REVAL: A Comprehension Evaluation on Reliability and Values of Large Vision-Language Models

Mar 20, 2025

Jie Zhang, Zheng Yuan, Zhongqi Wang, Bei Yan, Sibo Wang, Xiangkui Cao, Zonghui Guo, Shiguang Shan, Xilin Chen

Abstract:The rapid evolution of Large Vision-Language Models (LVLMs) has highlighted the necessity for comprehensive evaluation frameworks that assess these models across diverse dimensions. While existing benchmarks focus on specific aspects such as perceptual abilities, cognitive capabilities, and safety against adversarial attacks, they often lack the breadth and depth required to provide a holistic understanding of LVLMs' strengths and limitations. To address this gap, we introduce REVAL, a comprehensive benchmark designed to evaluate the \textbf{RE}liability and \textbf{VAL}ue of LVLMs. REVAL encompasses over 144K image-text Visual Question Answering (VQA) samples, structured into two primary sections: Reliability, which assesses truthfulness (\eg, perceptual accuracy and hallucination tendencies) and robustness (\eg, resilience to adversarial attacks, typographic attacks, and image corruption), and Values, which evaluates ethical concerns (\eg, bias and moral understanding), safety issues (\eg, toxicity and jailbreak vulnerabilities), and privacy problems (\eg, privacy awareness and privacy leakage). We evaluate 26 models, including mainstream open-source LVLMs and prominent closed-source models like GPT-4o and Gemini-1.5-Pro. Our findings reveal that while current LVLMs excel in perceptual tasks and toxicity avoidance, they exhibit significant vulnerabilities in adversarial scenarios, privacy preservation, and ethical reasoning. These insights underscore critical areas for future improvements, guiding the development of more secure, reliable, and ethically aligned LVLMs. REVAL provides a robust framework for researchers to systematically assess and compare LVLMs, fostering advancements in the field.

* 45 pages, 5 figures, 18 tables

Via

Access Paper or Ask Questions

Complementary Subspace Low-Rank Adaptation of Vision-Language Models for Few-Shot Classification

Jan 25, 2025

Zhongqi Wang, Jia Dai, Kai Li, Xu Li, Yanmeng Guo, Maosheng Xiang

Figure 1 for Complementary Subspace Low-Rank Adaptation of Vision-Language Models for Few-Shot Classification

Figure 2 for Complementary Subspace Low-Rank Adaptation of Vision-Language Models for Few-Shot Classification

Figure 3 for Complementary Subspace Low-Rank Adaptation of Vision-Language Models for Few-Shot Classification

Figure 4 for Complementary Subspace Low-Rank Adaptation of Vision-Language Models for Few-Shot Classification

Abstract:Vision language model (VLM) has been designed for large scale image-text alignment as a pretrained foundation model. For downstream few shot classification tasks, parameter efficient fine-tuning (PEFT) VLM has gained much popularity in the computer vision community. PEFT methods like prompt tuning and linear adapter have been studied for fine-tuning VLM while low rank adaptation (LoRA) algorithm has rarely been considered for few shot fine-tuning VLM. The main obstacle to use LoRA for few shot fine-tuning is the catastrophic forgetting problem. Because the visual language alignment knowledge is important for the generality in few shot learning, whereas low rank adaptation interferes with the most informative direction of the pretrained weight matrix. We propose the complementary subspace low rank adaptation (Comp-LoRA) method to regularize the catastrophic forgetting problem in few shot VLM finetuning. In detail, we optimize the low rank matrix in the complementary subspace, thus preserving the general vision language alignment ability of VLM when learning the novel few shot information. We conduct comparison experiments of the proposed Comp-LoRA method and other PEFT methods on fine-tuning VLM for few shot classification. And we also present the suppression on the catastrophic forgetting problem of our proposed method against directly applying LoRA to VLM. The results show that the proposed method surpasses the baseline method by about +1.0\% Top-1 accuracy and preserves the VLM zero-shot performance over the baseline method by about +1.3\% Top-1 accuracy.

* Preprint version

Via

Access Paper or Ask Questions

T2IShield: Defending Against Backdoors on Text-to-Image Diffusion Models

Jul 05, 2024

Zhongqi Wang, Jie Zhang, Shiguang Shan, Xilin Chen

Abstract:While text-to-image diffusion models demonstrate impressive generation capabilities, they also exhibit vulnerability to backdoor attacks, which involve the manipulation of model outputs through malicious triggers. In this paper, for the first time, we propose a comprehensive defense method named T2IShield to detect, localize, and mitigate such attacks. Specifically, we find the "Assimilation Phenomenon" on the cross-attention maps caused by the backdoor trigger. Based on this key insight, we propose two effective backdoor detection methods: Frobenius Norm Threshold Truncation and Covariance Discriminant Analysis. Besides, we introduce a binary-search approach to localize the trigger within a backdoor sample and assess the efficacy of existing concept editing methods in mitigating backdoor attacks. Empirical evaluations on two advanced backdoor attack scenarios show the effectiveness of our proposed defense method. For backdoor sample detection, T2IShield achieves a detection F1 score of 88.9$\%$ with low computational cost. Furthermore, T2IShield achieves a localization F1 score of 86.4$\%$ and invalidates 99$\%$ poisoned samples. Codes are released at https://github.com/Robin-WZQ/T2IShield.

* Accepted by ECCV2024

Via

Access Paper or Ask Questions

Dysca: A Dynamic and Scalable Benchmark for Evaluating Perception Ability of LVLMs

Jun 27, 2024

Jie Zhang, Zhongqi Wang, Mengqi Lei, Zheng Yuan, Bei Yan, Shiguang Shan, Xilin Chen

Abstract:Currently many benchmarks have been proposed to evaluate the perception ability of the Large Vision-Language Models (LVLMs). However, most benchmarks conduct questions by selecting images from existing datasets, resulting in the potential data leakage. Besides, these benchmarks merely focus on evaluating LVLMs on the realistic style images and clean scenarios, leaving the multi-stylized images and noisy scenarios unexplored. In response to these challenges, we propose a dynamic and scalable benchmark named Dysca for evaluating LVLMs by leveraging synthesis images. Specifically, we leverage Stable Diffusion and design a rule-based method to dynamically generate novel images, questions and the corresponding answers. We consider 51 kinds of image styles and evaluate the perception capability in 20 subtasks. Moreover, we conduct evaluations under 4 scenarios (i.e., Clean, Corruption, Print Attacking and Adversarial Attacking) and 3 question types (i.e., Multi-choices, True-or-false and Free-form). Thanks to the generative paradigm, Dysca serves as a scalable benchmark for easily adding new subtasks and scenarios. A total of 8 advanced open-source LVLMs with 10 checkpoints are evaluated on Dysca, revealing the drawbacks of current LVLMs. The benchmark is released in \url{https://github.com/Benchmark-Dysca/Dysca}.

Via

Access Paper or Ask Questions

Plug-and-Play Algorithm Convergence Analysis From The Standpoint of Stochastic Differential Equation

Apr 22, 2024

Zhongqi Wang, Bingnan Wang, Maosheng Xiang

Figure 1 for Plug-and-Play Algorithm Convergence Analysis From The Standpoint of Stochastic Differential Equation

Figure 2 for Plug-and-Play Algorithm Convergence Analysis From The Standpoint of Stochastic Differential Equation

Figure 3 for Plug-and-Play Algorithm Convergence Analysis From The Standpoint of Stochastic Differential Equation

Figure 4 for Plug-and-Play Algorithm Convergence Analysis From The Standpoint of Stochastic Differential Equation

Abstract:The Plug-and-Play (PnP) algorithm is popular for inverse image problem-solving. However, this algorithm lacks theoretical analysis of its convergence with more advanced plug-in denoisers. We demonstrate that discrete PnP iteration can be described by a continuous stochastic differential equation (SDE). We can also achieve this transformation through Markov process formulation of PnP. Then, we can take a higher standpoint of PnP algorithms from stochastic differential equations, and give a unified framework for the convergence property of PnP according to the solvability condition of its corresponding SDE. We reveal that a much weaker condition, bounded denoiser with Lipschitz continuous measurement function would be enough for its convergence guarantee, instead of previous Lipschitz continuous denoiser condition.

* 17pages, Preprint, Under review

Via

Access Paper or Ask Questions

CCLAP: Controllable Chinese Landscape Painting Generation via Latent Diffusion Model

Apr 22, 2023

Zhongqi Wang, Jie Zhang, Zhilong Ji, Jinfeng Bai, Shiguang Shan

Figure 1 for CCLAP: Controllable Chinese Landscape Painting Generation via Latent Diffusion Model

Figure 2 for CCLAP: Controllable Chinese Landscape Painting Generation via Latent Diffusion Model

Figure 3 for CCLAP: Controllable Chinese Landscape Painting Generation via Latent Diffusion Model

Figure 4 for CCLAP: Controllable Chinese Landscape Painting Generation via Latent Diffusion Model

Abstract:With the development of deep generative models, recent years have seen great success of Chinese landscape painting generation. However, few works focus on controllable Chinese landscape painting generation due to the lack of data and limited modeling capabilities. In this work, we propose a controllable Chinese landscape painting generation method named CCLAP, which can generate painting with specific content and style based on Latent Diffusion Model. Specifically, it consists of two cascaded modules, i.e., content generator and style aggregator. The content generator module guarantees the content of generated paintings specific to the input text. While the style aggregator module is to generate paintings of a style corresponding to a reference image. Moreover, a new dataset of Chinese landscape paintings named CLAP is collected for comprehensive evaluation. Both the qualitative and quantitative results demonstrate that our method achieves state-of-the-art performance, especially in artfully-composed and artistic conception. Codes are available at https://github.com/Robin-WZQ/CCLAP.

* 8 pages,13 figures

Via

Access Paper or Ask Questions