Picture for Rajashree Agrawal

Rajashree Agrawal

Modular addition without black-boxes: Compressing explanations of MLPs that compute numerical integration

Add code
Dec 04, 2024
Viaarxiv icon

Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach

Add code
Dec 03, 2024
Viaarxiv icon

When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?

Add code
Jul 21, 2024
Figure 1 for When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?
Figure 2 for When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?
Figure 3 for When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?
Figure 4 for When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?
Viaarxiv icon

Compact Proofs of Model Performance via Mechanistic Interpretability

Add code
Jun 24, 2024
Viaarxiv icon

Provable Guarantees for Model Performance via Mechanistic Interpretability

Add code
Jun 18, 2024
Viaarxiv icon

Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data

Add code
Apr 01, 2024
Viaarxiv icon