Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hoang Thanh-Tung

Improving the Robustness of Representation Misdirection for Large Language Model Unlearning

Jan 31, 2025

Dang Huu-Tien, Hoang Thanh-Tung, Le-Minh Nguyen, Naoya Inoue

Abstract:Representation Misdirection (RM) and variants are established large language model (LLM) unlearning methods with state-of-the-art performance. In this paper, we show that RM methods inherently reduce models' robustness, causing them to misbehave even when a single non-adversarial forget-token is in the retain-query. Toward understanding underlying causes, we reframe the unlearning process as backdoor attacks and defenses: forget-tokens act as backdoor triggers that, when activated in retain-queries, cause disruptions in RM models' behaviors, similar to successful backdoor attacks. To mitigate this vulnerability, we propose Random Noise Augmentation -- a model and method agnostic approach with theoretical guarantees for improving the robustness of RM methods. Extensive experiments demonstrate that RNA significantly improves the robustness of RM models while enhancing the unlearning performances.

* 12 pages, 4 figures, 1 table

Via

Access Paper or Ask Questions

On Effects of Steering Latent Representation for Large Language Model Unlearning

Aug 12, 2024

Dang Huu-Tien, Trung-Tin Pham, Hoang Thanh-Tung, Naoya Inoue

Figure 1 for On Effects of Steering Latent Representation for Large Language Model Unlearning

Figure 2 for On Effects of Steering Latent Representation for Large Language Model Unlearning

Figure 3 for On Effects of Steering Latent Representation for Large Language Model Unlearning

Figure 4 for On Effects of Steering Latent Representation for Large Language Model Unlearning

Abstract:Representation Misdirection for Unlearning (RMU), which steers model representation in the intermediate layer to a target random representation, is an effective method for large language model (LLM) unlearning. Despite its high performance, the underlying cause and explanation remain underexplored. In this paper, we first theoretically demonstrate that steering forget representations in the intermediate layer reduces token confidence, causing LLMs to generate wrong or nonsense responses. Second, we investigate how the coefficient influences the alignment of forget-sample representations with the random direction and hint at the optimal coefficient values for effective unlearning across different network layers. Third, we show that RMU unlearned models are robust against adversarial jailbreak attacks. Last, our empirical analysis shows that RMU is less effective when applied to the middle and later layers in LLMs. To resolve this drawback, we propose Adaptive RMU -- a simple yet effective alternative method that makes unlearning effective with most layers. Extensive experiments demonstrate that Adaptive RMU significantly improves the unlearning performance compared to prior art while incurring no additional computational cost.

* 15 pages, 5 figures, 8 tables

Via

Access Paper or Ask Questions

Wicked Oddities: Selectively Poisoning for Effective Clean-Label Backdoor Attacks

Jul 16, 2024

Quang H. Nguyen, Nguyen Ngoc-Hieu, The-Anh Ta, Thanh Nguyen-Tang, Kok-Seng Wong, Hoang Thanh-Tung, Khoa D. Doan

Figure 1 for Wicked Oddities: Selectively Poisoning for Effective Clean-Label Backdoor Attacks

Figure 2 for Wicked Oddities: Selectively Poisoning for Effective Clean-Label Backdoor Attacks

Figure 3 for Wicked Oddities: Selectively Poisoning for Effective Clean-Label Backdoor Attacks

Figure 4 for Wicked Oddities: Selectively Poisoning for Effective Clean-Label Backdoor Attacks

Abstract:Deep neural networks are vulnerable to backdoor attacks, a type of adversarial attack that poisons the training data to manipulate the behavior of models trained on such data. Clean-label attacks are a more stealthy form of backdoor attacks that can perform the attack without changing the labels of poisoned data. Early works on clean-label attacks added triggers to a random subset of the training set, ignoring the fact that samples contribute unequally to the attack's success. This results in high poisoning rates and low attack success rates. To alleviate the problem, several supervised learning-based sample selection strategies have been proposed. However, these methods assume access to the entire labeled training set and require training, which is expensive and may not always be practical. This work studies a new and more practical (but also more challenging) threat model where the attacker only provides data for the target class (e.g., in face recognition systems) and has no knowledge of the victim model or any other classes in the training set. We study different strategies for selectively poisoning a small set of training samples in the target class to boost the attack success rate in this setting. Our threat model poses a serious threat in training machine learning models with third-party datasets, since the attack can be performed effectively with limited information. Experiments on benchmark datasets illustrate the effectiveness of our strategies in improving clean-label backdoor attacks.

Via

Access Paper or Ask Questions

A Cosine Similarity-based Method for Out-of-Distribution Detection

Jun 23, 2023

Nguyen Ngoc-Hieu, Nguyen Hung-Quang, The-Anh Ta, Thanh Nguyen-Tang, Khoa D Doan, Hoang Thanh-Tung

Figure 1 for A Cosine Similarity-based Method for Out-of-Distribution Detection

Figure 2 for A Cosine Similarity-based Method for Out-of-Distribution Detection

Figure 3 for A Cosine Similarity-based Method for Out-of-Distribution Detection

Figure 4 for A Cosine Similarity-based Method for Out-of-Distribution Detection

Abstract:The ability to detect OOD data is a crucial aspect of practical machine learning applications. In this work, we show that cosine similarity between the test feature and the typical ID feature is a good indicator of OOD data. We propose Class Typical Matching (CTM), a post hoc OOD detection algorithm that uses a cosine similarity scoring function. Extensive experiments on multiple benchmarks show that CTM outperforms existing post hoc OOD detection methods.

* Accepted paper at ICML 2023 Workshop on Spurious Correlations, Invariance, and Stability. 10 pages (4 main + appendix)

Via

Access Paper or Ask Questions

Class based Influence Functions for Error Detection

May 02, 2023

Thang Nguyen-Duc, Hoang Thanh-Tung, Quan Hung Tran, Dang Huu-Tien, Hieu Ngoc Nguyen, Anh T. V. Dau, Nghi D. Q. Bui

Abstract:Influence functions (IFs) are a powerful tool for detecting anomalous examples in large scale datasets. However, they are unstable when applied to deep networks. In this paper, we provide an explanation for the instability of IFs and develop a solution to this problem. We show that IFs are unreliable when the two data points belong to two different classes. Our solution leverages class information to improve the stability of IFs. Extensive experiments show that our modification significantly improves the performance and stability of IFs while incurring no additional computational cost.

* Thang Nguyen-Duc, Hoang Thanh-Tung, and Quan Hung Tran are co-first authors of this paper. 12 pages, 12 figures. Accepted to ACL 2023

Via

Access Paper or Ask Questions

Towards Using Data-Centric Approach for Better Code Representation Learning

May 25, 2022

Anh Dau, Thang Nguyen-Duc, Hoang Thanh-Tung, Nghi Bui

Figure 1 for Towards Using Data-Centric Approach for Better Code Representation Learning

Figure 2 for Towards Using Data-Centric Approach for Better Code Representation Learning

Figure 3 for Towards Using Data-Centric Approach for Better Code Representation Learning

Figure 4 for Towards Using Data-Centric Approach for Better Code Representation Learning

Abstract:Despite the recent trend of creating source code models and applying them to software engineering tasks, the quality of such models is insufficient for real-world application. In this work, we focus on improving existing code learning models from the data-centric perspective instead of designing new source code models. We shed some light on this direction by using a so-called data-influence method to identify noisy samples of pre-trained code learning models. The data-influence method is to assess the similarity of a target sample to the correct samples to determine whether or not such the target sample is noisy. The results of our evaluation show that data-influence methods can identify noisy samples for the code classification and defection prediction tasks. We envision that the data-centric approach will be a key driver for developing source code models that are useful in practice.

Via

Access Paper or Ask Questions

Toward a Generalization Metric for Deep Generative Models

Nov 02, 2020

Hoang Thanh-Tung, Truyen Tran

Figure 1 for Toward a Generalization Metric for Deep Generative Models

Figure 2 for Toward a Generalization Metric for Deep Generative Models

Figure 3 for Toward a Generalization Metric for Deep Generative Models

Figure 4 for Toward a Generalization Metric for Deep Generative Models

Abstract:Measuring the generalization capacity of Deep Generative Models (DGMs) is difficult because of the curse of dimensionality. Evaluation metrics for DGMs like Inception Score, Frechet Inception Distance, Precision-Recall, and Neural Net Divergence try to estimate the distance between the generated distribution and the target distribution using a polynomial number of samples. These metrics are the target of researchers when designing new models. Despite the claims, it is still unclear how well they can measure the generalization capacity of a model. In this paper, we investigate the capacity of these metrics in measuring the generalization capacity. We introduce a framework for comparing the robustness of evaluation metrics. We show that better scores in these metrics do not imply better generalization. They can be fooled easily by a generator that memorizes a small subset of the training set. We propose a fix to the NND metric to make it more robust to noise in the generated data.

* 1st I Can't Believe It's Not Better Workshop (ICBINB@NeurIPS 2020). Source code is available at https://github.com/htt210/GeneralizationMetricGAN

Via

Access Paper or Ask Questions

Improving Generalization and Stability of Generative Adversarial Networks

Feb 11, 2019

Hoang Thanh-Tung, Truyen Tran, Svetha Venkatesh

Figure 1 for Improving Generalization and Stability of Generative Adversarial Networks

Figure 2 for Improving Generalization and Stability of Generative Adversarial Networks

Figure 3 for Improving Generalization and Stability of Generative Adversarial Networks

Figure 4 for Improving Generalization and Stability of Generative Adversarial Networks

Abstract:Generative Adversarial Networks (GANs) are one of the most popular tools for learning complex high dimensional distributions. However, generalization properties of GANs have not been well understood. In this paper, we analyze the generalization of GANs in practical settings. We show that discriminators trained on discrete datasets with the original GAN loss have poor generalization capability and do not approximate the theoretically optimal discriminator. We propose a zero-centered gradient penalty for improving the generalization of the discriminator by pushing it toward the optimal discriminator. The penalty guarantees the generalization and convergence of GANs. Experiments on synthetic and large scale datasets verify our theoretical analysis.

* ICLR 2019

Via

Access Paper or Ask Questions

On catastrophic forgetting and mode collapse in Generative Adversarial Networks

Sep 12, 2018

Hoang Thanh-Tung, Truyen Tran, Svetha Venkatesh

Figure 1 for On catastrophic forgetting and mode collapse in Generative Adversarial Networks

Figure 2 for On catastrophic forgetting and mode collapse in Generative Adversarial Networks

Figure 3 for On catastrophic forgetting and mode collapse in Generative Adversarial Networks

Figure 4 for On catastrophic forgetting and mode collapse in Generative Adversarial Networks

Abstract:Generative Adversarial Networks (GAN) are one of the most prominent tools for learning complicated distributions. However, problems such as mode collapse and catastrophic forgetting, prevent GAN from learning the target distribution. These problems are usually studied independently from each other. In this paper, we show that both problems are present in GAN and their combined effect makes the training of GAN unstable. We also show that methods such as gradient penalties and momentum based optimizers can improve the stability of GAN by effectively preventing these problems from happening. Finally, we study a mechanism for mode collapse to occur and propagate in feedforward neural networks.

* Workshop on Theoretical Foundation and Applications of Deep Generative Models, Stockholm, Sweden, 2018

Via

Access Paper or Ask Questions