Picture for Michael Backes

Michael Backes

Understanding LLM Behavior When Encountering User-Supplied Harmful Content in Harmless Tasks

Add code
Mar 12, 2026
Viaarxiv icon

Real Money, Fake Models: Deceptive Model Claims in Shadow APIs

Add code
Mar 05, 2026
Viaarxiv icon

Benchmark of Benchmarks: Unpacking Influence and Code Repository Quality in LLM Safety Benchmarks

Add code
Mar 03, 2026
Viaarxiv icon

Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs

Add code
Feb 09, 2026
Viaarxiv icon

Jailbreaking Attacks vs. Content Safety Filters: How Far Are We in the LLM Safety Arms Race?

Add code
Dec 30, 2025
Viaarxiv icon

JADES: A Universal Framework for Jailbreak Assessment via Decompositional Scoring

Add code
Aug 28, 2025
Viaarxiv icon

Hate in Plain Sight: On the Risks of Moderating AI-Generated Hateful Illusions

Add code
Jul 30, 2025
Viaarxiv icon

Excessive Reasoning Attack on Reasoning LLMs

Add code
Jun 17, 2025
Figure 1 for Excessive Reasoning Attack on Reasoning LLMs
Figure 2 for Excessive Reasoning Attack on Reasoning LLMs
Figure 3 for Excessive Reasoning Attack on Reasoning LLMs
Figure 4 for Excessive Reasoning Attack on Reasoning LLMs
Viaarxiv icon

SoK: Data Reconstruction Attacks Against Machine Learning Models: Definition, Metrics, and Benchmark

Add code
Jun 09, 2025
Viaarxiv icon

The Challenge of Identifying the Origin of Black-Box Large Language Models

Add code
Mar 06, 2025
Figure 1 for The Challenge of Identifying the Origin of Black-Box Large Language Models
Figure 2 for The Challenge of Identifying the Origin of Black-Box Large Language Models
Figure 3 for The Challenge of Identifying the Origin of Black-Box Large Language Models
Figure 4 for The Challenge of Identifying the Origin of Black-Box Large Language Models
Viaarxiv icon