Picture for Alexander Pan

Alexander Pan

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?

Add code
Jul 31, 2024
Viaarxiv icon

Foundational Challenges in Assuring Alignment and Safety of Large Language Models

Add code
Apr 15, 2024
Viaarxiv icon

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

Add code
Mar 06, 2024
Figure 1 for The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Figure 2 for The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Figure 3 for The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Figure 4 for The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Viaarxiv icon

Feedback Loops With Language Models Drive In-Context Reward Hacking

Add code
Feb 09, 2024
Viaarxiv icon

Representation Engineering: A Top-Down Approach to AI Transparency

Add code
Oct 10, 2023
Figure 1 for Representation Engineering: A Top-Down Approach to AI Transparency
Figure 2 for Representation Engineering: A Top-Down Approach to AI Transparency
Figure 3 for Representation Engineering: A Top-Down Approach to AI Transparency
Figure 4 for Representation Engineering: A Top-Down Approach to AI Transparency
Viaarxiv icon

Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark

Add code
Apr 06, 2023
Viaarxiv icon

The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models

Add code
Jan 10, 2022
Figure 1 for The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models
Figure 2 for The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models
Figure 3 for The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models
Figure 4 for The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models
Viaarxiv icon

Improving Robustness of Reinforcement Learning for Power System Control with Adversarial Training

Add code
Oct 19, 2021
Figure 1 for Improving Robustness of Reinforcement Learning for Power System Control with Adversarial Training
Figure 2 for Improving Robustness of Reinforcement Learning for Power System Control with Adversarial Training
Figure 3 for Improving Robustness of Reinforcement Learning for Power System Control with Adversarial Training
Figure 4 for Improving Robustness of Reinforcement Learning for Power System Control with Adversarial Training
Viaarxiv icon