Picture for Daniel Paleka

Daniel Paleka

Refusal in Language Models Is Mediated by a Single Direction

Add code
Jun 17, 2024
Viaarxiv icon

Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition

Add code
Jun 12, 2024
Figure 1 for Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition
Figure 2 for Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition
Figure 3 for Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition
Figure 4 for Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition
Viaarxiv icon

Foundational Challenges in Assuring Alignment and Safety of Large Language Models

Add code
Apr 15, 2024
Figure 1 for Foundational Challenges in Assuring Alignment and Safety of Large Language Models
Figure 2 for Foundational Challenges in Assuring Alignment and Safety of Large Language Models
Figure 3 for Foundational Challenges in Assuring Alignment and Safety of Large Language Models
Figure 4 for Foundational Challenges in Assuring Alignment and Safety of Large Language Models
Viaarxiv icon

ARB: Advanced Reasoning Benchmark for Large Language Models

Add code
Jul 28, 2023
Viaarxiv icon

Evaluating Superhuman Models with Consistency Checks

Add code
Jun 19, 2023
Viaarxiv icon

Poisoning Web-Scale Training Datasets is Practical

Add code
Feb 20, 2023
Viaarxiv icon

Red-Teaming the Stable Diffusion Safety Filter

Add code
Oct 11, 2022
Figure 1 for Red-Teaming the Stable Diffusion Safety Filter
Figure 2 for Red-Teaming the Stable Diffusion Safety Filter
Figure 3 for Red-Teaming the Stable Diffusion Safety Filter
Figure 4 for Red-Teaming the Stable Diffusion Safety Filter
Viaarxiv icon

A law of adversarial risk, interpolation, and label noise

Add code
Jul 08, 2022
Figure 1 for A law of adversarial risk, interpolation, and label noise
Figure 2 for A law of adversarial risk, interpolation, and label noise
Figure 3 for A law of adversarial risk, interpolation, and label noise
Figure 4 for A law of adversarial risk, interpolation, and label noise
Viaarxiv icon