Picture for Meg Tong

Meg Tong

Auditing language models for hidden objectives

Add code
Mar 14, 2025
Viaarxiv icon

Forecasting Rare Language Model Behaviors

Add code
Feb 24, 2025
Viaarxiv icon

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Add code
Jan 31, 2025
Figure 1 for Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
Figure 2 for Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
Figure 3 for Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
Figure 4 for Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
Viaarxiv icon

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Add code
Jan 17, 2024
Viaarxiv icon

Steering Llama 2 via Contrastive Activation Addition

Add code
Dec 09, 2023
Figure 1 for Steering Llama 2 via Contrastive Activation Addition
Figure 2 for Steering Llama 2 via Contrastive Activation Addition
Figure 3 for Steering Llama 2 via Contrastive Activation Addition
Figure 4 for Steering Llama 2 via Contrastive Activation Addition
Viaarxiv icon

Towards Understanding Sycophancy in Language Models

Add code
Oct 27, 2023
Figure 1 for Towards Understanding Sycophancy in Language Models
Figure 2 for Towards Understanding Sycophancy in Language Models
Figure 3 for Towards Understanding Sycophancy in Language Models
Figure 4 for Towards Understanding Sycophancy in Language Models
Viaarxiv icon

The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"

Add code
Sep 22, 2023
Figure 1 for The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"
Figure 2 for The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"
Figure 3 for The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"
Figure 4 for The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"
Viaarxiv icon

Taken out of context: On measuring situational awareness in LLMs

Add code
Sep 01, 2023
Viaarxiv icon