Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Samir Rajani

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Jan 31, 2025

Mrinank Sharma, Meg Tong, Jesse Mu, Jerry Wei, Jorrit Kruthoff, Scott Goodfriend, Euan Ong, Alwin Peng, Raj Agarwal, Cem Anil(+33 more)

Figure 1 for Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Figure 2 for Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Figure 3 for Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Figure 4 for Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Abstract:Large language models (LLMs) are vulnerable to universal jailbreaks-prompting strategies that systematically bypass model safeguards and enable users to carry out harmful processes that require many model interactions, like manufacturing illegal substances at scale. To defend against these attacks, we introduce Constitutional Classifiers: safeguards trained on synthetic data, generated by prompting LLMs with natural language rules (i.e., a constitution) specifying permitted and restricted content. In over 3,000 estimated hours of red teaming, no red teamer found a universal jailbreak that could extract information from an early classifier-guarded LLM at a similar level of detail to an unguarded model across most target queries. On automated evaluations, enhanced classifiers demonstrated robust defense against held-out domain-specific jailbreaks. These classifiers also maintain deployment viability, with an absolute 0.38% increase in production-traffic refusals and a 23.7% inference overhead. Our work demonstrates that defending against universal jailbreaks while maintaining practical deployment viability is tractable.

Via

Access Paper or Ask Questions

Adversarial Predictions of Data Distributions Across Federated Internet-of-Things Devices

Aug 28, 2023

Samir Rajani, Dario Dematties, Nathaniel Hudson, Kyle Chard, Nicola Ferrier, Rajesh Sankaran, Peter Beckman

Figure 1 for Adversarial Predictions of Data Distributions Across Federated Internet-of-Things Devices

Figure 2 for Adversarial Predictions of Data Distributions Across Federated Internet-of-Things Devices

Figure 3 for Adversarial Predictions of Data Distributions Across Federated Internet-of-Things Devices

Figure 4 for Adversarial Predictions of Data Distributions Across Federated Internet-of-Things Devices

Abstract:Federated learning (FL) is increasingly becoming the default approach for training machine learning models across decentralized Internet-of-Things (IoT) devices. A key advantage of FL is that no raw data are communicated across the network, providing an immediate layer of privacy. Despite this, recent works have demonstrated that data reconstruction can be done with the locally trained model updates which are communicated across the network. However, many of these works have limitations with regard to how the gradients are computed in backpropagation. In this work, we demonstrate that the model weights shared in FL can expose revealing information about the local data distributions of IoT devices. This leakage could expose sensitive information to malicious actors in a distributed system. We further discuss results which show that injecting noise into model weights is ineffective at preventing data leakage without seriously harming the global model accuracy.

* 6 pages, 6 figures, accepted for publication through 2023 IEEE World Forum on Internet of Things

Via

Access Paper or Ask Questions