Picture for Fazl Barez

Fazl Barez

Same Question, Different Words: A Latent Adversarial Framework for Prompt Robustness

Add code
Mar 03, 2025
Viaarxiv icon

Do Sparse Autoencoders Generalize? A Case Study of Answerability

Add code
Feb 27, 2025
Viaarxiv icon

Trust Me, I'm Wrong: High-Certainty Hallucinations in LLMs

Add code
Feb 18, 2025
Viaarxiv icon

Rethinking AI Cultural Evaluation

Add code
Jan 13, 2025
Viaarxiv icon

Open Problems in Machine Unlearning for AI Safety

Add code
Jan 09, 2025
Viaarxiv icon

Best-of-N Jailbreaking

Add code
Dec 04, 2024
Figure 1 for Best-of-N Jailbreaking
Figure 2 for Best-of-N Jailbreaking
Figure 3 for Best-of-N Jailbreaking
Figure 4 for Best-of-N Jailbreaking
Viaarxiv icon

Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach

Add code
Dec 03, 2024
Figure 1 for Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach
Figure 2 for Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach
Figure 3 for Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach
Figure 4 for Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach
Viaarxiv icon

PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning

Add code
Oct 11, 2024
Figure 1 for PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning
Figure 2 for PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning
Figure 3 for PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning
Figure 4 for PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning
Viaarxiv icon

Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models

Add code
Oct 09, 2024
Figure 1 for Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models
Figure 2 for Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models
Figure 3 for Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models
Figure 4 for Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models
Viaarxiv icon

Towards Interpreting Visual Information Processing in Vision-Language Models

Add code
Oct 09, 2024
Figure 1 for Towards Interpreting Visual Information Processing in Vision-Language Models
Figure 2 for Towards Interpreting Visual Information Processing in Vision-Language Models
Figure 3 for Towards Interpreting Visual Information Processing in Vision-Language Models
Figure 4 for Towards Interpreting Visual Information Processing in Vision-Language Models
Viaarxiv icon