Picture for Suraj Anand

Suraj Anand

Are PPO-ed Language Models Hackable?

Add code
May 28, 2024
Viaarxiv icon

Dual Process Learning: Controlling Use of In-Context vs. In-Weights Strategies with Weight Forgetting

Add code
May 28, 2024
Viaarxiv icon

Suppressing Pink Elephants with Direct Principle Feedback

Add code
Feb 13, 2024
Viaarxiv icon