Picture for Lei Sha

Lei Sha

Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues

Add code
Oct 14, 2024
Figure 1 for Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues
Figure 2 for Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues
Figure 3 for Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues
Figure 4 for Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues
Viaarxiv icon

BlackDAN: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models

Add code
Oct 13, 2024
Figure 1 for BlackDAN: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models
Figure 2 for BlackDAN: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models
Figure 3 for BlackDAN: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models
Figure 4 for BlackDAN: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models
Viaarxiv icon

Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models

Add code
Oct 10, 2024
Figure 1 for Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models
Figure 2 for Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models
Figure 3 for Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models
Figure 4 for Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models
Viaarxiv icon

Towards a Unified View of Preference Learning for Large Language Models: A Survey

Add code
Sep 04, 2024
Figure 1 for Towards a Unified View of Preference Learning for Large Language Models: A Survey
Figure 2 for Towards a Unified View of Preference Learning for Large Language Models: A Survey
Figure 3 for Towards a Unified View of Preference Learning for Large Language Models: A Survey
Figure 4 for Towards a Unified View of Preference Learning for Large Language Models: A Survey
Viaarxiv icon

HSF: Defending against Jailbreak Attacks with Hidden State Filtering

Add code
Aug 31, 2024
Figure 1 for HSF: Defending against Jailbreak Attacks with Hidden State Filtering
Figure 2 for HSF: Defending against Jailbreak Attacks with Hidden State Filtering
Figure 3 for HSF: Defending against Jailbreak Attacks with Hidden State Filtering
Figure 4 for HSF: Defending against Jailbreak Attacks with Hidden State Filtering
Viaarxiv icon

ATM: Adversarial Tuning Multi-agent System Makes a Robust Retrieval-Augmented Generator

Add code
May 28, 2024
Viaarxiv icon

Unlocking Multi-View Insights in Knowledge-Dense Retrieval-Augmented Generation

Add code
Apr 19, 2024
Viaarxiv icon

ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors

Add code
Feb 26, 2024
Viaarxiv icon

From Noise to Clarity: Unraveling the Adversarial Suffix of Large Language Model Attacks via Translation of Text Embeddings

Add code
Feb 25, 2024
Figure 1 for From Noise to Clarity: Unraveling the Adversarial Suffix of Large Language Model Attacks via Translation of Text Embeddings
Figure 2 for From Noise to Clarity: Unraveling the Adversarial Suffix of Large Language Model Attacks via Translation of Text Embeddings
Figure 3 for From Noise to Clarity: Unraveling the Adversarial Suffix of Large Language Model Attacks via Translation of Text Embeddings
Figure 4 for From Noise to Clarity: Unraveling the Adversarial Suffix of Large Language Model Attacks via Translation of Text Embeddings
Viaarxiv icon

Harnessing the Plug-and-Play Controller by Prompting

Add code
Feb 06, 2024
Figure 1 for Harnessing the Plug-and-Play Controller by Prompting
Figure 2 for Harnessing the Plug-and-Play Controller by Prompting
Figure 3 for Harnessing the Plug-and-Play Controller by Prompting
Figure 4 for Harnessing the Plug-and-Play Controller by Prompting
Viaarxiv icon