Picture for Been Kim

Been Kim

Getting aligned on representational alignment

Add code
Nov 02, 2023
Viaarxiv icon

Bridging the Human-AI Knowledge Gap: Concept Discovery and Transfer in AlphaZero

Add code
Oct 25, 2023
Viaarxiv icon

State2Explanation: Concept-Based Explanations to Benefit Agent Learning and User Understanding

Add code
Sep 21, 2023
Viaarxiv icon

Don't trust your eyes: on the reliability of feature visualizations

Add code
Jun 21, 2023
Viaarxiv icon

Gaussian Process Probes (GPP) for Uncertainty-Aware Probing

Add code
May 29, 2023
Figure 1 for Gaussian Process Probes (GPP) for Uncertainty-Aware Probing
Figure 2 for Gaussian Process Probes (GPP) for Uncertainty-Aware Probing
Figure 3 for Gaussian Process Probes (GPP) for Uncertainty-Aware Probing
Figure 4 for Gaussian Process Probes (GPP) for Uncertainty-Aware Probing
Viaarxiv icon

Model evaluation for extreme risks

Add code
May 24, 2023
Figure 1 for Model evaluation for extreme risks
Figure 2 for Model evaluation for extreme risks
Figure 3 for Model evaluation for extreme risks
Figure 4 for Model evaluation for extreme risks
Viaarxiv icon

Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models

Add code
Jan 10, 2023
Viaarxiv icon

Impossibility Theorems for Feature Attribution

Add code
Dec 22, 2022
Viaarxiv icon

On the Relationship Between Explanation and Prediction: A Causal View

Add code
Dec 20, 2022
Viaarxiv icon

Post hoc Explanations may be Ineffective for Detecting Unknown Spurious Correlation

Add code
Dec 09, 2022
Figure 1 for Post hoc Explanations may be Ineffective for Detecting Unknown Spurious Correlation
Figure 2 for Post hoc Explanations may be Ineffective for Detecting Unknown Spurious Correlation
Figure 3 for Post hoc Explanations may be Ineffective for Detecting Unknown Spurious Correlation
Figure 4 for Post hoc Explanations may be Ineffective for Detecting Unknown Spurious Correlation
Viaarxiv icon