Picture for Paul Röttger

Paul Röttger

University of Oxford

Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation

Add code
Oct 04, 2024
Viaarxiv icon

Compromesso! Italian Many-Shot Jailbreaks Undermine the Safety of Large Language Models

Add code
Aug 08, 2024
Viaarxiv icon

Evidence of a log scaling law for political persuasion with large language models

Add code
Jun 20, 2024
Viaarxiv icon

From Languages to Geographies: Towards Evaluating Cultural Bias in Hate Speech Datasets

Add code
Apr 27, 2024
Viaarxiv icon

Near to Mid-term Risks and Opportunities of Open Source Generative AI

Add code
Apr 25, 2024
Figure 1 for Near to Mid-term Risks and Opportunities of Open Source Generative AI
Figure 2 for Near to Mid-term Risks and Opportunities of Open Source Generative AI
Figure 3 for Near to Mid-term Risks and Opportunities of Open Source Generative AI
Figure 4 for Near to Mid-term Risks and Opportunities of Open Source Generative AI
Viaarxiv icon

The PRISM Alignment Project: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models

Add code
Apr 24, 2024
Viaarxiv icon

Introducing v0.5 of the AI Safety Benchmark from MLCommons

Add code
Apr 18, 2024
Viaarxiv icon

Look at the Text: Instruction-Tuned Language Models are More Robust Multiple Choice Selectors than You Think

Add code
Apr 12, 2024
Figure 1 for Look at the Text: Instruction-Tuned Language Models are More Robust Multiple Choice Selectors than You Think
Figure 2 for Look at the Text: Instruction-Tuned Language Models are More Robust Multiple Choice Selectors than You Think
Figure 3 for Look at the Text: Instruction-Tuned Language Models are More Robust Multiple Choice Selectors than You Think
Figure 4 for Look at the Text: Instruction-Tuned Language Models are More Robust Multiple Choice Selectors than You Think
Viaarxiv icon

SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety

Add code
Apr 08, 2024
Viaarxiv icon

Improving Adversarial Data Collection by Supporting Annotators: Lessons from GAHD, a German Hate Speech Dataset

Add code
Mar 28, 2024
Viaarxiv icon