Picture for Fazl Barez

Fazl Barez

Chain-of-Thought Hijacking

Add code
Oct 30, 2025
Viaarxiv icon

Rethinking Safety in LLM Fine-tuning: An Optimization Perspective

Add code
Aug 17, 2025
Viaarxiv icon

Establishing Best Practices for Building Rigorous Agentic Benchmarks

Add code
Jul 03, 2025
Figure 1 for Establishing Best Practices for Building Rigorous Agentic Benchmarks
Figure 2 for Establishing Best Practices for Building Rigorous Agentic Benchmarks
Figure 3 for Establishing Best Practices for Building Rigorous Agentic Benchmarks
Figure 4 for Establishing Best Practices for Building Rigorous Agentic Benchmarks
Viaarxiv icon

The Singapore Consensus on Global AI Safety Research Priorities

Add code
Jun 25, 2025
Figure 1 for The Singapore Consensus on Global AI Safety Research Priorities
Figure 2 for The Singapore Consensus on Global AI Safety Research Priorities
Figure 3 for The Singapore Consensus on Global AI Safety Research Priorities
Viaarxiv icon

Beyond Linear Steering: Unified Multi-Attribute Control for Language Models

Add code
May 30, 2025
Viaarxiv icon

Precise In-Parameter Concept Erasure in Large Language Models

Add code
May 28, 2025
Viaarxiv icon

SafetyNet: Detecting Harmful Outputs in LLMs by Modeling and Monitoring Deceptive Behaviors

Add code
May 20, 2025
Viaarxiv icon

Scaling sparse feature circuit finding for in-context learning

Add code
Apr 18, 2025
Viaarxiv icon

Same Question, Different Words: A Latent Adversarial Framework for Prompt Robustness

Add code
Mar 03, 2025
Viaarxiv icon

Do Sparse Autoencoders Generalize? A Case Study of Answerability

Add code
Feb 27, 2025
Viaarxiv icon