Picture for Zhexin Zhang

Zhexin Zhang

Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks

Add code
Jul 03, 2024
Viaarxiv icon

Knowledge-to-Jailbreak: One Knowledge Point Worth One Attack

Add code
Jun 17, 2024
Viaarxiv icon

ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors

Add code
Feb 26, 2024
Viaarxiv icon

Unveiling the Implicit Toxicity in Large Language Models

Add code
Nov 29, 2023
Viaarxiv icon

Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization

Add code
Nov 15, 2023
Viaarxiv icon

SafetyBench: Evaluating the Safety of Large Language Models with Multiple Choice Questions

Add code
Sep 13, 2023
Viaarxiv icon

Ethicist: Targeted Training Data Extraction Through Loss Smoothed Soft Prompting and Calibrated Confidence Estimation

Add code
Jul 10, 2023
Viaarxiv icon

Safety Assessment of Chinese Large Language Models

Add code
Apr 20, 2023
Viaarxiv icon

Recent Advances towards Safe, Responsible, and Moral Dialogue Systems: A Survey

Add code
Feb 18, 2023
Viaarxiv icon

MoralDial: A Framework to Train and Evaluate Moral Dialogue Systems via Constructing Moral Discussions

Add code
Dec 21, 2022
Figure 1 for MoralDial: A Framework to Train and Evaluate Moral Dialogue Systems via Constructing Moral Discussions
Figure 2 for MoralDial: A Framework to Train and Evaluate Moral Dialogue Systems via Constructing Moral Discussions
Figure 3 for MoralDial: A Framework to Train and Evaluate Moral Dialogue Systems via Constructing Moral Discussions
Figure 4 for MoralDial: A Framework to Train and Evaluate Moral Dialogue Systems via Constructing Moral Discussions
Viaarxiv icon