Picture for Euan Ong

Euan Ong

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Add code
Jan 31, 2025
Viaarxiv icon

Compact Proofs of Model Performance via Mechanistic Interpretability

Add code
Jun 24, 2024
Figure 1 for Compact Proofs of Model Performance via Mechanistic Interpretability
Figure 2 for Compact Proofs of Model Performance via Mechanistic Interpretability
Figure 3 for Compact Proofs of Model Performance via Mechanistic Interpretability
Figure 4 for Compact Proofs of Model Performance via Mechanistic Interpretability
Viaarxiv icon

Provable Guarantees for Model Performance via Mechanistic Interpretability

Add code
Jun 18, 2024
Figure 1 for Provable Guarantees for Model Performance via Mechanistic Interpretability
Figure 2 for Provable Guarantees for Model Performance via Mechanistic Interpretability
Figure 3 for Provable Guarantees for Model Performance via Mechanistic Interpretability
Figure 4 for Provable Guarantees for Model Performance via Mechanistic Interpretability
Viaarxiv icon

Successor Heads: Recurring, Interpretable Attention Heads In The Wild

Add code
Dec 14, 2023
Viaarxiv icon

Image Hijacks: Adversarial Images can Control Generative Models at Runtime

Add code
Sep 18, 2023
Figure 1 for Image Hijacks: Adversarial Images can Control Generative Models at Runtime
Figure 2 for Image Hijacks: Adversarial Images can Control Generative Models at Runtime
Figure 3 for Image Hijacks: Adversarial Images can Control Generative Models at Runtime
Figure 4 for Image Hijacks: Adversarial Images can Control Generative Models at Runtime
Viaarxiv icon

Learnable Commutative Monoids for Graph Neural Networks

Add code
Dec 16, 2022
Viaarxiv icon