Picture for Nathan Helm-Burger

Nathan Helm-Burger

Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models

Add code
Dec 02, 2024
Figure 1 for Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models
Figure 2 for Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models
Figure 3 for Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models
Figure 4 for Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models
Viaarxiv icon

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

Add code
Mar 06, 2024
Figure 1 for The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Figure 2 for The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Figure 3 for The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Figure 4 for The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Viaarxiv icon

Will releasing the weights of future large language models grant widespread access to pandemic agents?

Add code
Nov 01, 2023
Figure 1 for Will releasing the weights of future large language models grant widespread access to pandemic agents?
Viaarxiv icon