Picture for Joshua Clymer

Joshua Clymer

Towards evaluations-based safety cases for AI scheming

Add code
Nov 07, 2024
Viaarxiv icon

Meta-Models: An Architecture for Decoding LLM Behaviors Through Interpreted Embeddings and Natural Language

Add code
Oct 03, 2024
Viaarxiv icon

Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals

Add code
May 11, 2024
Viaarxiv icon

Safety Cases: How to Justify the Safety of Advanced AI Systems

Add code
Mar 18, 2024
Viaarxiv icon

Generalization Analogies: A Testbed for Generalizing AI Oversight to Hard-To-Measure Domains

Add code
Nov 19, 2023
Viaarxiv icon