Picture for Udari Madhushani Sehwag

Udari Madhushani Sehwag

AdvBDGen: Adversarially Fortified Prompt-Specific Fuzzy Backdoor Generator Against LLM Alignment

Add code
Oct 15, 2024
Figure 1 for AdvBDGen: Adversarially Fortified Prompt-Specific Fuzzy Backdoor Generator Against LLM Alignment
Figure 2 for AdvBDGen: Adversarially Fortified Prompt-Specific Fuzzy Backdoor Generator Against LLM Alignment
Figure 3 for AdvBDGen: Adversarially Fortified Prompt-Specific Fuzzy Backdoor Generator Against LLM Alignment
Figure 4 for AdvBDGen: Adversarially Fortified Prompt-Specific Fuzzy Backdoor Generator Against LLM Alignment
Viaarxiv icon

Can LLMs be Scammed? A Baseline Measurement Study

Add code
Oct 14, 2024
Figure 1 for Can LLMs be Scammed? A Baseline Measurement Study
Figure 2 for Can LLMs be Scammed? A Baseline Measurement Study
Figure 3 for Can LLMs be Scammed? A Baseline Measurement Study
Figure 4 for Can LLMs be Scammed? A Baseline Measurement Study
Viaarxiv icon

GenARM: Reward Guided Generation with Autoregressive Reward Model for Test-time Alignment

Add code
Oct 10, 2024
Figure 1 for GenARM: Reward Guided Generation with Autoregressive Reward Model for Test-time Alignment
Figure 2 for GenARM: Reward Guided Generation with Autoregressive Reward Model for Test-time Alignment
Figure 3 for GenARM: Reward Guided Generation with Autoregressive Reward Model for Test-time Alignment
Figure 4 for GenARM: Reward Guided Generation with Autoregressive Reward Model for Test-time Alignment
Viaarxiv icon

SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors

Add code
Jun 20, 2024
Viaarxiv icon