Picture for Daniel Paleka

Daniel Paleka

Large-scale online deanonymization with LLMs

Add code
Feb 18, 2026
Viaarxiv icon

Consistency Checks for Language Model Forecasters

Add code
Dec 24, 2024
Viaarxiv icon

Refusal in Language Models Is Mediated by a Single Direction

Add code
Jun 17, 2024
Figure 1 for Refusal in Language Models Is Mediated by a Single Direction
Figure 2 for Refusal in Language Models Is Mediated by a Single Direction
Figure 3 for Refusal in Language Models Is Mediated by a Single Direction
Figure 4 for Refusal in Language Models Is Mediated by a Single Direction
Viaarxiv icon

Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition

Add code
Jun 12, 2024
Figure 1 for Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition
Figure 2 for Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition
Figure 3 for Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition
Figure 4 for Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition
Viaarxiv icon

Foundational Challenges in Assuring Alignment and Safety of Large Language Models

Add code
Apr 15, 2024
Figure 1 for Foundational Challenges in Assuring Alignment and Safety of Large Language Models
Figure 2 for Foundational Challenges in Assuring Alignment and Safety of Large Language Models
Figure 3 for Foundational Challenges in Assuring Alignment and Safety of Large Language Models
Figure 4 for Foundational Challenges in Assuring Alignment and Safety of Large Language Models
Viaarxiv icon

ARB: Advanced Reasoning Benchmark for Large Language Models

Add code
Jul 28, 2023
Figure 1 for ARB: Advanced Reasoning Benchmark for Large Language Models
Figure 2 for ARB: Advanced Reasoning Benchmark for Large Language Models
Figure 3 for ARB: Advanced Reasoning Benchmark for Large Language Models
Figure 4 for ARB: Advanced Reasoning Benchmark for Large Language Models
Viaarxiv icon

Evaluating Superhuman Models with Consistency Checks

Add code
Jun 19, 2023
Figure 1 for Evaluating Superhuman Models with Consistency Checks
Figure 2 for Evaluating Superhuman Models with Consistency Checks
Figure 3 for Evaluating Superhuman Models with Consistency Checks
Figure 4 for Evaluating Superhuman Models with Consistency Checks
Viaarxiv icon

Poisoning Web-Scale Training Datasets is Practical

Add code
Feb 20, 2023
Viaarxiv icon

Red-Teaming the Stable Diffusion Safety Filter

Add code
Oct 11, 2022
Figure 1 for Red-Teaming the Stable Diffusion Safety Filter
Figure 2 for Red-Teaming the Stable Diffusion Safety Filter
Figure 3 for Red-Teaming the Stable Diffusion Safety Filter
Figure 4 for Red-Teaming the Stable Diffusion Safety Filter
Viaarxiv icon

A law of adversarial risk, interpolation, and label noise

Add code
Jul 08, 2022
Figure 1 for A law of adversarial risk, interpolation, and label noise
Figure 2 for A law of adversarial risk, interpolation, and label noise
Figure 3 for A law of adversarial risk, interpolation, and label noise
Figure 4 for A law of adversarial risk, interpolation, and label noise
Viaarxiv icon