Picture for Julian Michael

Julian Michael

Alignment faking in large language models

Add code
Dec 18, 2024
Viaarxiv icon

Rapid Response: Mitigating LLM Jailbreaks with a Few Examples

Add code
Nov 12, 2024
Figure 1 for Rapid Response: Mitigating LLM Jailbreaks with a Few Examples
Figure 2 for Rapid Response: Mitigating LLM Jailbreaks with a Few Examples
Figure 3 for Rapid Response: Mitigating LLM Jailbreaks with a Few Examples
Figure 4 for Rapid Response: Mitigating LLM Jailbreaks with a Few Examples
Viaarxiv icon

Training Language Models to Win Debates with Self-Play Improves Judge Accuracy

Add code
Sep 25, 2024
Viaarxiv icon

Analyzing the Role of Semantic Representations in the Era of Large Language Models

Add code
May 02, 2024
Viaarxiv icon

Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought

Add code
Mar 08, 2024
Figure 1 for Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought
Figure 2 for Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought
Figure 3 for Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought
Figure 4 for Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought
Viaarxiv icon

The Case for Scalable, Data-Driven Theory: A Paradigm for Scientific Progress in NLP

Add code
Dec 01, 2023
Viaarxiv icon

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Add code
Nov 20, 2023
Viaarxiv icon

Debate Helps Supervise Unreliable Experts

Add code
Nov 15, 2023
Viaarxiv icon

Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting

Add code
May 07, 2023
Viaarxiv icon

We're Afraid Language Models Aren't Modeling Ambiguity

Add code
Apr 27, 2023
Viaarxiv icon