Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ryo Hase

Smoothed Embeddings for Robust Language Models

Jan 27, 2025

Ryo Hase, Md Rafi Ur Rashid, Ashley Lewis, Jing Liu, Toshiaki Koike-Akino, Kieran Parsons, Ye Wang

Figure 1 for Smoothed Embeddings for Robust Language Models

Figure 2 for Smoothed Embeddings for Robust Language Models

Figure 3 for Smoothed Embeddings for Robust Language Models

Figure 4 for Smoothed Embeddings for Robust Language Models

Abstract:Improving the safety and reliability of large language models (LLMs) is a crucial aspect of realizing trustworthy AI systems. Although alignment methods aim to suppress harmful content generation, LLMs are often still vulnerable to jailbreaking attacks that employ adversarial inputs that subvert alignment and induce harmful outputs. We propose the Randomized Embedding Smoothing and Token Aggregation (RESTA) defense, which adds random noise to the embedding vectors and performs aggregation during the generation of each output token, with the aim of better preserving semantic information. Our experiments demonstrate that our approach achieves superior robustness versus utility tradeoffs compared to the baseline defenses.

* Presented in the Safe Generative AI Workshop at NeurIPS 2024

Via

Access Paper or Ask Questions

Variational Randomized Smoothing for Sample-Wise Adversarial Robustness

Jul 16, 2024

Ryo Hase, Ye Wang, Toshiaki Koike-Akino, Jing Liu, Kieran Parsons

Figure 1 for Variational Randomized Smoothing for Sample-Wise Adversarial Robustness

Figure 2 for Variational Randomized Smoothing for Sample-Wise Adversarial Robustness

Figure 3 for Variational Randomized Smoothing for Sample-Wise Adversarial Robustness

Figure 4 for Variational Randomized Smoothing for Sample-Wise Adversarial Robustness

Abstract:Randomized smoothing is a defensive technique to achieve enhanced robustness against adversarial examples which are small input perturbations that degrade the performance of neural network models. Conventional randomized smoothing adds random noise with a fixed noise level for every input sample to smooth out adversarial perturbations. This paper proposes a new variational framework that uses a per-sample noise level suitable for each input by introducing a noise level selector. Our experimental results demonstrate enhancement of empirical robustness against adversarial attacks. We also provide and analyze the certified robustness for our sample-wise smoothing method.

* 20 pages, under preparation

Via

Access Paper or Ask Questions