Picture for Senthooran Rajamanoharan

Senthooran Rajamanoharan

Dense SAE Latents Are Features, Not Bugs

Add code
Jun 18, 2025
Viaarxiv icon

Model Organisms for Emergent Misalignment

Add code
Jun 13, 2025
Viaarxiv icon

Convergent Linear Representations of Emergent Misalignment

Add code
Jun 13, 2025
Viaarxiv icon

Towards eliciting latent knowledge from LLMs with mechanistic interpretability

Add code
May 20, 2025
Viaarxiv icon

An Approach to Technical AGI Safety and Security

Add code
Apr 02, 2025
Viaarxiv icon

Chain-of-Thought Reasoning In The Wild Is Not Always Faithful

Add code
Mar 13, 2025
Viaarxiv icon

Are Sparse Autoencoders Useful? A Case Study in Sparse Probing

Add code
Feb 23, 2025
Viaarxiv icon

Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models

Add code
Nov 21, 2024
Viaarxiv icon

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

Add code
Aug 09, 2024
Figure 1 for Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
Figure 2 for Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
Figure 3 for Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
Figure 4 for Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
Viaarxiv icon

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

Add code
Jul 19, 2024
Figure 1 for Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders
Figure 2 for Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders
Figure 3 for Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders
Figure 4 for Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders
Viaarxiv icon