Picture for Mantas Mazeika

Mantas Mazeika

Shammie

Tamper-Resistant Safeguards for Open-Weight LLMs

Add code
Aug 01, 2024
Figure 1 for Tamper-Resistant Safeguards for Open-Weight LLMs
Figure 2 for Tamper-Resistant Safeguards for Open-Weight LLMs
Figure 3 for Tamper-Resistant Safeguards for Open-Weight LLMs
Figure 4 for Tamper-Resistant Safeguards for Open-Weight LLMs
Viaarxiv icon

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?

Add code
Jul 31, 2024
Viaarxiv icon

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

Add code
Mar 06, 2024
Figure 1 for The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Figure 2 for The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Figure 3 for The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Figure 4 for The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Viaarxiv icon

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Add code
Feb 06, 2024
Viaarxiv icon

Representation Engineering: A Top-Down Approach to AI Transparency

Add code
Oct 10, 2023
Figure 1 for Representation Engineering: A Top-Down Approach to AI Transparency
Figure 2 for Representation Engineering: A Top-Down Approach to AI Transparency
Figure 3 for Representation Engineering: A Top-Down Approach to AI Transparency
Figure 4 for Representation Engineering: A Top-Down Approach to AI Transparency
Viaarxiv icon

An Overview of Catastrophic AI Risks

Add code
Jul 11, 2023
Viaarxiv icon

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

Add code
Jun 20, 2023
Viaarxiv icon

How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios

Add code
Oct 18, 2022
Figure 1 for How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios
Figure 2 for How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios
Figure 3 for How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios
Figure 4 for How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios
Viaarxiv icon

Forecasting Future World Events with Neural Networks

Add code
Jun 30, 2022
Figure 1 for Forecasting Future World Events with Neural Networks
Figure 2 for Forecasting Future World Events with Neural Networks
Figure 3 for Forecasting Future World Events with Neural Networks
Figure 4 for Forecasting Future World Events with Neural Networks
Viaarxiv icon

How to Steer Your Adversary: Targeted and Efficient Model Stealing Defenses with Gradient Redirection

Add code
Jun 28, 2022
Figure 1 for How to Steer Your Adversary: Targeted and Efficient Model Stealing Defenses with Gradient Redirection
Figure 2 for How to Steer Your Adversary: Targeted and Efficient Model Stealing Defenses with Gradient Redirection
Figure 3 for How to Steer Your Adversary: Targeted and Efficient Model Stealing Defenses with Gradient Redirection
Figure 4 for How to Steer Your Adversary: Targeted and Efficient Model Stealing Defenses with Gradient Redirection
Viaarxiv icon