Picture for Suraj Srinivas

Suraj Srinivas

Towards Unifying Interpretability and Control: Evaluation via Intervention

Add code
Nov 07, 2024
Viaarxiv icon

Generalized Group Data Attribution

Add code
Oct 13, 2024
Viaarxiv icon

How much can we forget about Data Contamination?

Add code
Oct 04, 2024
Viaarxiv icon

All Roads Lead to Rome? Exploring Representational Similarities Between Latent Spaces of Generative Image Models

Add code
Jul 18, 2024
Viaarxiv icon

Interpreting CLIP with Sparse Linear Concept Embeddings (SpLiCE)

Add code
Feb 16, 2024
Viaarxiv icon

Certifying LLM Safety against Adversarial Prompting

Add code
Sep 06, 2023
Figure 1 for Certifying LLM Safety against Adversarial Prompting
Figure 2 for Certifying LLM Safety against Adversarial Prompting
Figure 3 for Certifying LLM Safety against Adversarial Prompting
Figure 4 for Certifying LLM Safety against Adversarial Prompting
Viaarxiv icon

Verifiable Feature Attributions: A Bridge between Post Hoc Explainability and Inherent Interpretability

Add code
Jul 27, 2023
Viaarxiv icon

Efficient Estimation of the Local Robustness of Machine Learning Models

Add code
Jul 26, 2023
Viaarxiv icon

Consistent Explanations in the Face of Model Indeterminacy via Ensembling

Add code
Jun 13, 2023
Viaarxiv icon

On Minimizing the Impact of Dataset Shifts on Actionable Explanations

Add code
Jun 11, 2023
Viaarxiv icon