Picture for Ruixuan Huang

Ruixuan Huang

GuidedBench: Equipping Jailbreak Evaluation with Guidelines

Add code
Feb 24, 2025
Viaarxiv icon

Evaluating Concept-based Explanations of Language Models: A Study on Faithfulness and Readability

Add code
Apr 30, 2024
Viaarxiv icon

Uncovering Safety Risks in Open-source LLMs through Concept Activation Vector

Add code
Apr 18, 2024
Viaarxiv icon