Picture for Jacob Steinhardt

Jacob Steinhardt

LatentQA: Teaching LLMs to Decode Activations Into Natural Language

Add code
Dec 11, 2024
Viaarxiv icon

Extractive Structures Learned in Pretraining Enable Generalization on Finetuned Facts

Add code
Dec 05, 2024
Viaarxiv icon

What Do Learning Dynamics Reveal About Generalization in LLM Reasoning?

Add code
Nov 12, 2024
Figure 1 for What Do Learning Dynamics Reveal About Generalization in LLM Reasoning?
Figure 2 for What Do Learning Dynamics Reveal About Generalization in LLM Reasoning?
Figure 3 for What Do Learning Dynamics Reveal About Generalization in LLM Reasoning?
Figure 4 for What Do Learning Dynamics Reveal About Generalization in LLM Reasoning?
Viaarxiv icon

VibeCheck: Discover and Quantify Qualitative Differences in Large Language Models

Add code
Oct 10, 2024
Viaarxiv icon

Explaining Datasets in Words: Statistical Models with Natural Language Parameters

Add code
Sep 13, 2024
Figure 1 for Explaining Datasets in Words: Statistical Models with Natural Language Parameters
Figure 2 for Explaining Datasets in Words: Statistical Models with Natural Language Parameters
Figure 3 for Explaining Datasets in Words: Statistical Models with Natural Language Parameters
Figure 4 for Explaining Datasets in Words: Statistical Models with Natural Language Parameters
Viaarxiv icon

Safety vs. Performance: How Multi-Objective Learning Reduces Barriers to Market Entry

Add code
Sep 05, 2024
Viaarxiv icon

Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation

Add code
Jun 28, 2024
Figure 1 for Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
Figure 2 for Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
Figure 3 for Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
Figure 4 for Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
Viaarxiv icon

Monitoring Latent World States in Language Models with Propositional Probes

Add code
Jun 27, 2024
Viaarxiv icon

Adversaries Can Misuse Combinations of Safe Models

Add code
Jun 20, 2024
Viaarxiv icon

Interpreting the Second-Order Effects of Neurons in CLIP

Add code
Jun 06, 2024
Viaarxiv icon