Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Borui Yang

Hallucination Detection in Large Language Models with Metamorphic Relations

Feb 20, 2025

Borui Yang, Md Afif Al Mamun, Jie M. Zhang, Gias Uddin

Abstract:Large Language Models (LLMs) are prone to hallucinations, e.g., factually incorrect information, in their responses. These hallucinations present challenges for LLM-based applications that demand high factual accuracy. Existing hallucination detection methods primarily depend on external resources, which can suffer from issues such as low availability, incomplete coverage, privacy concerns, high latency, low reliability, and poor scalability. There are also methods depending on output probabilities, which are often inaccessible for closed-source LLMs like GPT models. This paper presents MetaQA, a self-contained hallucination detection approach that leverages metamorphic relation and prompt mutation. Unlike existing methods, MetaQA operates without any external resources and is compatible with both open-source and closed-source LLMs. MetaQA is based on the hypothesis that if an LLM's response is a hallucination, the designed metamorphic relations will be violated. We compare MetaQA with the state-of-the-art zero-resource hallucination detection method, SelfCheckGPT, across multiple datasets, and on two open-source and two closed-source LLMs. Our results reveal that MetaQA outperforms SelfCheckGPT in terms of precision, recall, and f1 score. For the four LLMs we study, MetaQA outperforms SelfCheckGPT with a superiority margin ranging from 0.041 - 0.113 (for precision), 0.143 - 0.430 (for recall), and 0.154 - 0.368 (for F1-score). For instance, with Mistral-7B, MetaQA achieves an average F1-score of 0.435, compared to SelfCheckGPT's F1-score of 0.205, representing an improvement rate of 112.2%. MetaQA also demonstrates superiority across all different categories of questions.

* 21 pages, 11 figures

Via

Access Paper or Ask Questions

Hufu: A Modality-Agnositc Watermarking System for Pre-Trained Transformers via Permutation Equivariance

Mar 09, 2024

Hengyuan Xu, Liyao Xiang, Xingjun Ma, Borui Yang, Baochun Li

Figure 1 for Hufu: A Modality-Agnositc Watermarking System for Pre-Trained Transformers via Permutation Equivariance

Figure 2 for Hufu: A Modality-Agnositc Watermarking System for Pre-Trained Transformers via Permutation Equivariance

Figure 3 for Hufu: A Modality-Agnositc Watermarking System for Pre-Trained Transformers via Permutation Equivariance

Figure 4 for Hufu: A Modality-Agnositc Watermarking System for Pre-Trained Transformers via Permutation Equivariance

Abstract:With the blossom of deep learning models and services, it has become an imperative concern to safeguard the valuable model parameters from being stolen. Watermarking is considered an important tool for ownership verification. However, current watermarking schemes are customized for different models and tasks, hard to be integrated as an integrated intellectual protection service. We propose Hufu, a modality-agnostic watermarking system for pre-trained Transformer-based models, relying on the permutation equivariance property of Transformers. Hufu embeds watermark by fine-tuning the pre-trained model on a set of data samples specifically permuted, and the embedded model essentially contains two sets of weights -- one for normal use and the other for watermark extraction which is triggered on permuted inputs. The permutation equivariance ensures minimal interference between these two sets of model weights and thus high fidelity on downstream tasks. Since our method only depends on the model itself, it is naturally modality-agnostic, task-independent, and trigger-sample-free. Extensive experiments on the state-of-the-art vision Transformers, BERT, and GPT2 have demonstrated Hufu's superiority in meeting watermarking requirements including effectiveness, efficiency, fidelity, and robustness, showing its great potential to be deployed as a uniform ownership verification service for various Transformers.

Via

Access Paper or Ask Questions

Energy Attack: On Transferring Adversarial Examples

Sep 09, 2021

Ruoxi Shi, Borui Yang, Yangzhou Jiang, Chenglong Zhao, Bingbing Ni

Figure 1 for Energy Attack: On Transferring Adversarial Examples

Figure 2 for Energy Attack: On Transferring Adversarial Examples

Figure 3 for Energy Attack: On Transferring Adversarial Examples

Figure 4 for Energy Attack: On Transferring Adversarial Examples

Abstract:In this work we propose Energy Attack, a transfer-based black-box $L_\infty$-adversarial attack. The attack is parameter-free and does not require gradient approximation. In particular, we first obtain white-box adversarial perturbations of a surrogate model and divide these perturbations into small patches. Then we extract the unit component vectors and eigenvalues of these patches with principal component analysis (PCA). Base on the eigenvalues, we can model the energy distribution of adversarial perturbations. We then perform black-box attacks by sampling from the perturbation patches according to their energy distribution, and tiling the sampled patches to form a full-size adversarial perturbation. This can be done without the available access to victim models. Extensive experiments well demonstrate that the proposed Energy Attack achieves state-of-the-art performance in black-box attacks on various models and several datasets. Moreover, the extracted distribution is able to transfer among different model architectures and different datasets, and is therefore intrinsic to vision architectures.

* Under Review for AAAI-22

Via

Access Paper or Ask Questions