Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zilei Shao

Adversarial Tokenization

Mar 04, 2025

Renato Lui Geh, Zilei Shao, Guy Van den Broeck

Abstract:Current LLM pipelines account for only one possible tokenization for a given string, ignoring exponentially many alternative tokenizations during training and inference. For example, the standard Llama3 tokenization of penguin is [p,enguin], yet [peng,uin] is another perfectly valid alternative. In this paper, we show that despite LLMs being trained solely on one tokenization, they still retain semantic understanding of other tokenizations, raising questions about their implications in LLM safety. Put succinctly, we answer the following question: can we adversarially tokenize an obviously malicious string to evade safety and alignment restrictions? We show that not only is adversarial tokenization an effective yet previously neglected axis of attack, but it is also competitive against existing state-of-the-art adversarial approaches without changing the text of the harmful request. We empirically validate this exploit across three state-of-the-art LLMs and adversarial datasets, revealing a previously unknown vulnerability in subword models.

Via

Access Paper or Ask Questions

Bounding-Box Inference for Error-Aware Model-Based Reinforcement Learning

Jun 23, 2024

Erin J. Talvitie, Zilei Shao, Huiying Li, Jinghan Hu, Jacob Boerma, Rory Zhao, Xintong Wang

Figure 1 for Bounding-Box Inference for Error-Aware Model-Based Reinforcement Learning

Figure 2 for Bounding-Box Inference for Error-Aware Model-Based Reinforcement Learning

Figure 3 for Bounding-Box Inference for Error-Aware Model-Based Reinforcement Learning

Figure 4 for Bounding-Box Inference for Error-Aware Model-Based Reinforcement Learning

Abstract:In model-based reinforcement learning, simulated experiences from the learned model are often treated as equivalent to experience from the real environment. However, when the model is inaccurate, it can catastrophically interfere with policy learning. Alternatively, the agent might learn about the model's accuracy and selectively use it only when it can provide reliable predictions. We empirically explore model uncertainty measures for selective planning and show that best results require distribution insensitive inference to estimate the uncertainty over model-based updates. To that end, we propose and evaluate bounding-box inference, which operates on bounding-boxes around sets of possible states and other quantities. We find that bounding-box inference can reliably support effective selective planning.

* To appear: Reinforcement Learning Conference (RLC), 2024

Via

Access Paper or Ask Questions