Picture for Jan Wehner

Jan Wehner

Probe-based Fine-tuning for Reducing Toxicity

Add code
Oct 24, 2025
Viaarxiv icon

Taxonomy, Opportunities, and Challenges of Representation Engineering for Large Language Models

Add code
Feb 27, 2025
Viaarxiv icon

Safety is Essential for Responsible Open-Ended Systems

Add code
Feb 06, 2025
Viaarxiv icon

Representation noising effectively prevents harmful fine-tuning on LLMs

Add code
May 23, 2024
Figure 1 for Representation noising effectively prevents harmful fine-tuning on LLMs
Figure 2 for Representation noising effectively prevents harmful fine-tuning on LLMs
Figure 3 for Representation noising effectively prevents harmful fine-tuning on LLMs
Figure 4 for Representation noising effectively prevents harmful fine-tuning on LLMs
Viaarxiv icon

Immunization against harmful fine-tuning attacks

Add code
Feb 26, 2024
Viaarxiv icon

Explaining Learned Reward Functions with Counterfactual Trajectories

Add code
Feb 07, 2024
Figure 1 for Explaining Learned Reward Functions with Counterfactual Trajectories
Figure 2 for Explaining Learned Reward Functions with Counterfactual Trajectories
Figure 3 for Explaining Learned Reward Functions with Counterfactual Trajectories
Figure 4 for Explaining Learned Reward Functions with Counterfactual Trajectories
Viaarxiv icon