Picture for Scott Emmons

Scott Emmons

Obfuscated Activations Bypass LLM Latent-Space Defenses

Add code
Dec 12, 2024
Viaarxiv icon

Will an AI with Private Information Allow Itself to Be Switched Off?

Add code
Nov 25, 2024
Figure 1 for Will an AI with Private Information Allow Itself to Be Switched Off?
Figure 2 for Will an AI with Private Information Allow Itself to Be Switched Off?
Figure 3 for Will an AI with Private Information Allow Itself to Be Switched Off?
Figure 4 for Will an AI with Private Information Allow Itself to Be Switched Off?
Viaarxiv icon

When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?

Add code
Jul 21, 2024
Figure 1 for When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?
Figure 2 for When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?
Figure 3 for When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?
Figure 4 for When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?
Viaarxiv icon

Evidence of Learned Look-Ahead in a Chess-Playing Neural Network

Add code
Jun 02, 2024
Figure 1 for Evidence of Learned Look-Ahead in a Chess-Playing Neural Network
Figure 2 for Evidence of Learned Look-Ahead in a Chess-Playing Neural Network
Figure 3 for Evidence of Learned Look-Ahead in a Chess-Playing Neural Network
Figure 4 for Evidence of Learned Look-Ahead in a Chess-Playing Neural Network
Viaarxiv icon

When Your AIs Deceive You: Challenges with Partial Observability of Human Evaluators in Reward Learning

Add code
Mar 03, 2024
Viaarxiv icon

Uncovering Latent Human Wellbeing in Language Model Embeddings

Add code
Feb 19, 2024
Viaarxiv icon

A StrongREJECT for Empty Jailbreaks

Add code
Feb 15, 2024
Viaarxiv icon

ALMANACS: A Simulatability Benchmark for Language Model Explainability

Add code
Dec 20, 2023
Viaarxiv icon

Image Hijacks: Adversarial Images can Control Generative Models at Runtime

Add code
Sep 18, 2023
Viaarxiv icon

Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark

Add code
Apr 06, 2023
Viaarxiv icon