Picture for Alexander Pan

Alexander Pan

LatentQA: Teaching LLMs to Decode Activations Into Natural Language

Add code
Dec 11, 2024
Viaarxiv icon

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?

Add code
Jul 31, 2024
Viaarxiv icon

Foundational Challenges in Assuring Alignment and Safety of Large Language Models

Add code
Apr 15, 2024
Figure 1 for Foundational Challenges in Assuring Alignment and Safety of Large Language Models
Figure 2 for Foundational Challenges in Assuring Alignment and Safety of Large Language Models
Figure 3 for Foundational Challenges in Assuring Alignment and Safety of Large Language Models
Figure 4 for Foundational Challenges in Assuring Alignment and Safety of Large Language Models
Viaarxiv icon

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

Add code
Mar 06, 2024
Figure 1 for The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Figure 2 for The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Figure 3 for The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Figure 4 for The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Viaarxiv icon

Feedback Loops With Language Models Drive In-Context Reward Hacking

Add code
Feb 09, 2024
Viaarxiv icon

Representation Engineering: A Top-Down Approach to AI Transparency

Add code
Oct 10, 2023
Figure 1 for Representation Engineering: A Top-Down Approach to AI Transparency
Figure 2 for Representation Engineering: A Top-Down Approach to AI Transparency
Figure 3 for Representation Engineering: A Top-Down Approach to AI Transparency
Figure 4 for Representation Engineering: A Top-Down Approach to AI Transparency
Viaarxiv icon

Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark

Add code
Apr 06, 2023
Viaarxiv icon

The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models

Add code
Jan 10, 2022
Figure 1 for The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models
Figure 2 for The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models
Figure 3 for The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models
Figure 4 for The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models
Viaarxiv icon

Improving Robustness of Reinforcement Learning for Power System Control with Adversarial Training

Add code
Oct 19, 2021
Figure 1 for Improving Robustness of Reinforcement Learning for Power System Control with Adversarial Training
Figure 2 for Improving Robustness of Reinforcement Learning for Power System Control with Adversarial Training
Figure 3 for Improving Robustness of Reinforcement Learning for Power System Control with Adversarial Training
Figure 4 for Improving Robustness of Reinforcement Learning for Power System Control with Adversarial Training
Viaarxiv icon