Picture for Zhenhong Zhou

Zhenhong Zhou

On the Role of Attention Heads in Large Language Model Safety

Add code
Oct 17, 2024
Figure 1 for On the Role of Attention Heads in Large Language Model Safety
Figure 2 for On the Role of Attention Heads in Large Language Model Safety
Figure 3 for On the Role of Attention Heads in Large Language Model Safety
Figure 4 for On the Role of Attention Heads in Large Language Model Safety
Viaarxiv icon

Alignment-Enhanced Decoding:Defending via Token-Level Adaptive Refining of Probability Distributions

Add code
Aug 14, 2024
Viaarxiv icon

Course-Correction: Safety Alignment Using Synthetic Preferences

Add code
Jul 23, 2024
Figure 1 for Course-Correction: Safety Alignment Using Synthetic Preferences
Figure 2 for Course-Correction: Safety Alignment Using Synthetic Preferences
Figure 3 for Course-Correction: Safety Alignment Using Synthetic Preferences
Figure 4 for Course-Correction: Safety Alignment Using Synthetic Preferences
Viaarxiv icon

How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States

Add code
Jun 09, 2024
Viaarxiv icon

Speak Out of Turn: Safety Vulnerability of Large Language Models in Multi-turn Dialogue

Add code
Feb 27, 2024
Viaarxiv icon

Quantifying and Analyzing Entity-level Memorization in Large Language Models

Add code
Aug 30, 2023
Viaarxiv icon