Picture for Stephen Casper

Stephen Casper

Obfuscated Activations Bypass LLM Latent-Space Defenses

Add code
Dec 12, 2024
Viaarxiv icon

The Reality of AI and Biorisk

Add code
Dec 02, 2024
Viaarxiv icon

What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks

Add code
Nov 02, 2024
Viaarxiv icon

Multilevel Interpretability Of Artificial Neural Networks: Leveraging Framework And Methods From Neuroscience

Add code
Aug 26, 2024
Figure 1 for Multilevel Interpretability Of Artificial Neural Networks: Leveraging Framework And Methods From Neuroscience
Figure 2 for Multilevel Interpretability Of Artificial Neural Networks: Leveraging Framework And Methods From Neuroscience
Viaarxiv icon

Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

Add code
Jul 22, 2024
Viaarxiv icon

Foundational Challenges in Assuring Alignment and Safety of Large Language Models

Add code
Apr 15, 2024
Figure 1 for Foundational Challenges in Assuring Alignment and Safety of Large Language Models
Figure 2 for Foundational Challenges in Assuring Alignment and Safety of Large Language Models
Figure 3 for Foundational Challenges in Assuring Alignment and Safety of Large Language Models
Figure 4 for Foundational Challenges in Assuring Alignment and Safety of Large Language Models
Viaarxiv icon

The SaTML '24 CNN Interpretability Competition: New Innovations for Concept-Level Interpretability

Add code
Apr 03, 2024
Viaarxiv icon

Defending Against Unforeseen Failure Modes with Latent Adversarial Training

Add code
Mar 08, 2024
Viaarxiv icon

Eight Methods to Evaluate Robust Unlearning in LLMs

Add code
Feb 26, 2024
Viaarxiv icon

Rethinking Machine Unlearning for Large Language Models

Add code
Feb 15, 2024
Viaarxiv icon