Picture for Zhengxuan Wu

Zhengxuan Wu

Dancing in Chains: Reconciling Instruction Following and Faithfulness in Language Models

Add code
Jul 31, 2024
Viaarxiv icon

ReFT: Representation Finetuning for Language Models

Add code
Apr 08, 2024
Viaarxiv icon

Mapping the Increasing Use of LLMs in Scientific Papers

Add code
Apr 01, 2024
Viaarxiv icon

pyvene: A Library for Understanding and Improving PyTorch Models via Interventions

Add code
Mar 12, 2024
Viaarxiv icon

In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation

Add code
Mar 12, 2024
Viaarxiv icon

RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations

Add code
Feb 27, 2024
Viaarxiv icon

A Reply to Makelov et al. 's "Interpretability Illusion" Arguments

Add code
Jan 23, 2024
Viaarxiv icon

Rigorously Assessing Natural Language Explanations of Neurons

Add code
Sep 19, 2023
Viaarxiv icon

MQuAKE: Assessing Knowledge Editing in Language Models via Multi-Hop Questions

Add code
May 24, 2023
Viaarxiv icon

Interpretability at Scale: Identifying Causal Mechanisms in Alpaca

Add code
May 15, 2023
Viaarxiv icon