Picture for Fazl Barez

Fazl Barez

Best-of-N Jailbreaking

Add code
Dec 04, 2024
Viaarxiv icon

Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach

Add code
Dec 03, 2024
Viaarxiv icon

PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning

Add code
Oct 11, 2024
Figure 1 for PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning
Figure 2 for PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning
Figure 3 for PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning
Figure 4 for PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning
Viaarxiv icon

Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models

Add code
Oct 09, 2024
Figure 1 for Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models
Figure 2 for Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models
Figure 3 for Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models
Figure 4 for Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models
Viaarxiv icon

Towards Interpreting Visual Information Processing in Vision-Language Models

Add code
Oct 09, 2024
Figure 1 for Towards Interpreting Visual Information Processing in Vision-Language Models
Figure 2 for Towards Interpreting Visual Information Processing in Vision-Language Models
Figure 3 for Towards Interpreting Visual Information Processing in Vision-Language Models
Figure 4 for Towards Interpreting Visual Information Processing in Vision-Language Models
Viaarxiv icon

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

Add code
Jun 17, 2024
Viaarxiv icon

Risks and Opportunities of Open-Source Generative AI

Add code
May 14, 2024
Viaarxiv icon

Visualizing Neural Network Imagination

Add code
May 10, 2024
Figure 1 for Visualizing Neural Network Imagination
Figure 2 for Visualizing Neural Network Imagination
Figure 3 for Visualizing Neural Network Imagination
Figure 4 for Visualizing Neural Network Imagination
Viaarxiv icon

Near to Mid-term Risks and Opportunities of Open Source Generative AI

Add code
Apr 25, 2024
Figure 1 for Near to Mid-term Risks and Opportunities of Open Source Generative AI
Figure 2 for Near to Mid-term Risks and Opportunities of Open Source Generative AI
Figure 3 for Near to Mid-term Risks and Opportunities of Open Source Generative AI
Figure 4 for Near to Mid-term Risks and Opportunities of Open Source Generative AI
Viaarxiv icon

Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions

Add code
Feb 23, 2024
Viaarxiv icon