Picture for Fazl Barez

Fazl Barez

Rethinking AI Cultural Evaluation

Add code
Jan 13, 2025
Viaarxiv icon

Open Problems in Machine Unlearning for AI Safety

Add code
Jan 09, 2025
Viaarxiv icon

Best-of-N Jailbreaking

Add code
Dec 04, 2024
Viaarxiv icon

Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach

Add code
Dec 03, 2024
Figure 1 for Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach
Figure 2 for Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach
Figure 3 for Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach
Figure 4 for Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach
Viaarxiv icon

PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning

Add code
Oct 11, 2024
Figure 1 for PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning
Figure 2 for PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning
Figure 3 for PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning
Figure 4 for PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning
Viaarxiv icon

Towards Interpreting Visual Information Processing in Vision-Language Models

Add code
Oct 09, 2024
Figure 1 for Towards Interpreting Visual Information Processing in Vision-Language Models
Figure 2 for Towards Interpreting Visual Information Processing in Vision-Language Models
Figure 3 for Towards Interpreting Visual Information Processing in Vision-Language Models
Figure 4 for Towards Interpreting Visual Information Processing in Vision-Language Models
Viaarxiv icon

Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models

Add code
Oct 09, 2024
Figure 1 for Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models
Figure 2 for Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models
Figure 3 for Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models
Figure 4 for Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models
Viaarxiv icon

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

Add code
Jun 17, 2024
Viaarxiv icon

Risks and Opportunities of Open-Source Generative AI

Add code
May 14, 2024
Viaarxiv icon

Visualizing Neural Network Imagination

Add code
May 10, 2024
Figure 1 for Visualizing Neural Network Imagination
Figure 2 for Visualizing Neural Network Imagination
Figure 3 for Visualizing Neural Network Imagination
Figure 4 for Visualizing Neural Network Imagination
Viaarxiv icon