Picture for Himabindu Lakkaraju

Himabindu Lakkaraju

Towards Unifying Interpretability and Control: Evaluation via Intervention

Add code
Nov 07, 2024
Viaarxiv icon

Generalized Group Data Attribution

Add code
Oct 13, 2024
Viaarxiv icon

Quantifying Generalization Complexity for Large Language Models

Add code
Oct 02, 2024
Viaarxiv icon

Learning Recourse Costs from Pairwise Feature Comparisons

Add code
Sep 20, 2024
Viaarxiv icon

Explaining the Model, Protecting Your Data: Revealing and Mitigating the Data Privacy Risks of Post-Hoc Model Explanations via Membership Inference

Add code
Jul 24, 2024
Viaarxiv icon

All Roads Lead to Rome? Exploring Representational Similarities Between Latent Spaces of Generative Image Models

Add code
Jul 18, 2024
Viaarxiv icon

Operationalizing the Blueprint for an AI Bill of Rights: Recommendations for Practitioners, Researchers, and Policy Makers

Add code
Jul 11, 2024
Viaarxiv icon

On the Hardness of Faithful Chain-of-Thought Reasoning in Large Language Models

Add code
Jun 15, 2024
Viaarxiv icon

Interpretability Needs a New Paradigm

Add code
May 08, 2024
Viaarxiv icon

More RLHF, More Trust? On The Impact of Human Preference Alignment On Language Model Trustworthiness

Add code
Apr 29, 2024
Viaarxiv icon